Multivariate polynomial interpolation is a key computation in many areas of science and engineering and, in our case, is crucial for the solution of the reverse engineering of genetic networks modeled by finite fields. Faster implementations of such algorithms are needed to cope with the increasing quantity and complexity of genetic data. We present a new algorithm based on Lagrange interpolation for multivariate polynomials that not only identifies redundant variables in the data and generates polynomials containing only nonredundant variables, but also computes exclusively on a reduced data set. Implementation of this algorithm to FPGA led us to identify a systolic array-based architecture useful for performing three interpolation subtasks: Boolean cover, distinctness, and polynomial addition. We present a generalization of these tasks that simplifies their mapping to the systolic array, and control and storage considerations to guarantee correct results for input sequences longer than the array. The subtasks were modeled and implemented to FPGA using the proposed architecture, then used as building blocks to implement the rest of the algorithm. Speedups up to 172× and 67× were obtained for the subtasks and complete application, respectively, when compared to a software implementation, while achieving moderate resource utilization.
Introduction
Recent years have seen a significant increase in methods and tools to collect genetic data from which important information can be extracted using a number of techniques [1] . For instance, microarray data collected at various steps in organism development can help geneticists understand its developmental process and response to environmental stimuli [2] . Several models such as ordinary differential equation models [3] , continuous models [4] , stochastic models [5] , and discrete models, of which Boolean models [6] are special cases, have been proposed. Essentially, each node represents the expression level of a gene (or genes) of interest at a time point. A directed edge from node a to node b symbolizes the influence of gene(s) in a to the gene(s) in b.
Our research group focuses on multivariate finite field gene network (MFFGN) models, in which multiple genes are monitored at each time step and their expression levels are discretized to a predefined set of values {0, 1, 2, . . . , p − 1}, where p is a prime number, that is, the expression level of each gene is an element of the prime field Z p [7, 8] . One way to deduce gene interaction from these models is to solve the reverse engineering problem, that is, find functions that describe the network's state transitions. An important step in the solution of this problem is determining an interpolating polynomial for the points by using such methods as a multivariate version of Lagrange's interpolation formula [8] . Polynomial interpolation over finite fields also has applications in error correcting codes and is a major building block of many numerical methods [9, 10] .
Fast algorithms and implementations are needed to sustain rapidly increasing bioinformatics computational demands [11] . Reconfigurable logic represents an attractive platform for the high performance implementation of finite field algorithms, with many designs achieving significant speedups over their software equivalents. However, despite the availability of tools to ease the hardware design flow, there is still a notable gap between the bioinformatics and 2 International Journal of Reconfigurable Computing hardware experts [12] . Our group developed a reconfigurable logic implementation of a multivariate polynomial interpolation algorithm suitable for reverse engineering in genetic networks. Rather than striving for a specialized punctual design, we used this opportunity to identify and develop common structures that might be of use to other researchers and applications. Well-documented and parameterizable components that perform functionalities that are common in many algorithms will alleviate the effort spent on new implementations.
In this paper, we discuss the interpolation algorithm and our proposed architecture. We emphasize several computational substructures that appear repeatedly throughout the design and which can be implemented effectively using a hardware systolic array-based architecture. These tasks are of the type where we perform a certain reduction or rearrangement of a sequence of elements from multivariate polynomials or boolean expressions. The systolic array concurrently manages data receipt and parallel processing, making it an amenable structure when dealing with streamed data. The simplicity of the array cells, storage, and control unit, allows the instantiation of multiple cells while maintaining competitive clock frequencies, thus achieving high performance. Several tasks critical to interpolation were modeled and implemented to FPGA using the proposed architecture, obtaining speedups up to 172× when compared to a software implementation, while achieving low resource utilization. These implementations were used as components to develop a complete, high-performance multivariate polynomial interpolation methodology on hardware.
Paper Outline. Section 2 discusses the interpolation methodology, while Section 3 details data representation for the implementation. Section 4 describes the hardware implementation in general. Sections 5 and 6 describe an algorithmic generalization for the reduction tasks and present the proposed hardware architecture. Section 7 explains the rest of the implementation blocks. Section 8 reports the results of FPGA implementations, both of reduction tasks and the complete implementation. Section 9 provides our conclusions.
Interpolation Methodology
Discrete models, in particular finite field models, have been proposed for regulatory processes such as genetic networks [7] and biochemical networks [13] . In the setting of a finite field model a genetic network can be seen as a finite system whose states are governed by the iterations of a tuple of polynomials (P 1 , P 2 , . . . , P n ), where n is the number of genes.
Given a sequence of k n-tuples of values from a set F, representing the states of n genes at times t 0 through t k−1 , the reverse engineering problem seeks to find a function f :
When F is a finite field, function f can be represented by a tuple of polynomials (P 1 , P 2 , . . . , P n ) where P i : F n → F for each i = 1, 2, . . . , n. As pointed out in [8] such 0  1  2  0  1  1  2  0  1  2  0  1  0  2  1  2  1  1  0  2 polynomials can be found by the following variation of Lagrange's interpolation formula: 3 and cannot be further simplified by any algebraic means. However, the polynomial P = 2x 2 + 2x 2 x 3 also interpolates f at the given points but depends only on variables x 2 and x 3 . In this context, x 1 is a redundant variable, since we found another interpolating polynomial for f that does not include x 1 in its expression.
Redundant variables are undesirable as they introduce complexity into the polynomials without adding information valuable to the biologist. Furthermore, empirical data suggest that genetic networks are sparsely connected [14] . Redundant variables can be identified using Sasao's algorithm for the detection of dependent variables in incompletely specified multiple valued functions [15] . 0  2  1  2  3  1  0  2  1  3  1  1  2  1  1  0  1  1  0  1  1  2  0  2  2  1  2  1  0  2  2  1  2  0  1  2  2  1  2  2  1  2  0  1  2  1  1  3  2  2  2  2  2  3 We say that a variable (2) there is no other set of variables that properly contains X with this property.
Essential Variables and
x i is essential in f or depends on f if there exist c, d ∈ D such that c i / = d i , c j = d j for all j / = i, and f (c) / = f (d). A set of variables {x i1 , x i2 , . . . , x im } is a basis for f if (1) for every c, d ∈ D, proj X c = proj X d if and only if f (c) = f (d) and
Lemma 1. If x is an essential variable and X is a basis, then
Proof. Let x be an essential variable. Then there exists c, d ∈ D and i such that
For any function f : D → B and any basis X, f can be expressed solely in terms of the variables of X. For any basis X, any variable x / ∈ X is redundant (with respect to X.) Our goal is to determine interpolating polynomials over finite fields in terms of the variables of any of its bases. To this end, let us first review an algorithm of Sasao [15] which determines all bases for any multiple valued function, not just those over finite fields.
Let f :
The set of variables appearing in each disjunct of R constitutes a basis.
Thus, the algorithm consists of two stages. First, determine R, which requires O(b 2 ) steps and second, convert R to disjunctive normal form, which requires O(2 n ) steps.
The following lemma, whose proof is immediate, is useful in simplifying expressions such as x 1 ∧ x 1 x 2 = x 1 and
Lemma 2. Let S = {x i1 , x i2 , . . . , x id } be a subset of the variables {x 1 , x 2 , . . . , x n }, let E be the disjunction of the variables in S and let F be a disjunction of some subset of S. Also let G be the conjunction of some subset of S. Then
In what follows we abbreviate E ∧ F by EF. Table 2 .
Applying Algorithm 1 and the reduction rules of Lemma 2, we obtain
Thus, the bases are
Interpolating Polynomials with No Redundant Variables.
Given any partially defined function f over a finite field and any basis X for f , we would like to determine interpolating polynomials involving only those variables of X, that is, polynomials containing no redundant variables. 
Then an interpolating polynomial for f whose only variables are those of X is
where jm is the smallest index such that x jm ∈ X and a j and a m differ. 
Equation (3) gives us a new algorithm for polynomial interpolation that involves only those variables in the chosen basis X, and furthermore, it also avoids redundant tuples by computing only on the representatives of the equivalence classes given by the relation ≡ X .
Using (3), O(k 2 ) operations are needed to interpolate k points for each function f : F n → F. Given k points of a function f : F n → F n , the bases for the component functions f i need not be the same. In this case, we can apply (3) n times, thus giving an algorithm with complexity O(nk 2 ).
Example 3. Let us assume that the 0, 1, 2, 3 of Example 2 represent the corresponding elements of Z 5 . We found that the bases are X 1 = {x 1 , x 2 , x 3 }, X 2 = {x 1 , x 3 , x 5 }, and X 3 = {x 3 , x 4 , x 5 }. Now let us use the new algorithm given by (3) to find an interpolating polynomial over Z 5 in terms of the basis X 1 . The equivalence classes under X 1 are
and the set of class representatives is
Applying (3) and making use of (6), we have
Thus, a polynomial which depends only on the variables x 1 , x 2 , and x 3 and interpolates the tuples given by Example 2 is
(10)
Data Representation
As can be deduced from the previous section, an implementation of the multivariate polynomial interpolation algorithm must be able to represent and compute on disjunctive normal form (DNF), conjunctive normal form (CNF), and multivariate polynomial expressions. The following subsections establish our method of representation, which ultimately determines the techniques used for implementing the algorithms.
DNF and CNF Expressions. A Boolean expression D in DNF using only uncomplemented variables is a disjunction of conjunctive clauses
We shall refer to each c i term as a cube. For example, the following is a Boolean expression in DNF: x 1 x 2 ∨ x 1 x 3 ∨ x 0 . In our scheme, a DNF expression is represented as a set of cubes {c 0 , c 1 , . . . , c q−1 } where each c i is represented as a sequence t i0 , t i1 , . . . , t im−1 . Observe that complemented variables are not used in the algorithm for finding essential variables and bases, hence their representation is not necessary in this context. is represented as a sequence t i0 , t i1 , . . . , t im−1 . Within our implementation there is no need to distinguish the CNF from the DNF representation since no individual component needs to operate on both simultaneously. 
General Description of Implementation
Our algorithm for determining interpolating polynomials that do not contain any redundant variables consists of two stages. The first stage identifies the dependent variables using Sasao's algorithm, while the second uses this information along with (3) to determine P i . Thus, a key part of the reconfigurable logic implementation is choosing efficient hardware structures to perform the critical parts of both stages. For the first stage, the most time-consuming task is the conversion of a Boolean expression in conjunctive normal form (CNF) to disjunctive normal form (DNF). The second step relies heavily on the identification of distinct binomials as they are generated from (3) and multivariate polynomial multiplication/addition.
To take advantage of the FPGA's fine-grained parallelism and considering their limitations in I/O bandwidth, these computations were implemented in a pipelined manner. In other words, computational blocks were designed to sustain stream processing as allowed by the FPGA's resources, rather than accumulate all needed data and then perform block processing. Figure 1 shows a block diagram of the implementation. The CNF Generator receives the n-tuples and performs Algorithm 1 to generate the disjuntions of literals, that is, the r(i, j) terms. These are streamed to the Cover DOLs block which eliminates redundant disjuntions of literals by using Lemma 2. The nonredundant DOLs are then fed to block CNF2DNF which computes the conjunctions of literals and uses Lemma 2 to eliminate redundant conjunctions. The output of CNF2DNF is the set of bases. A priori biological knowledge about the organism under study is required to choose the most adequate base. The n-tuples are also fed to the second stage which uses the chosen base to determine representatives of the equivalence classes from the n-tuples. The representatives are then streamed to the modified Lagrange interpolation block which implements (3) by generating the binomials for each pair of class representatives (Binomial Generator), eliminates duplicate binomials (Filter Duplicates), then multiplies and adds the product results.
The highlighted blocks in Figure 1 , that is, Cover DOLs, Cover COLs, Determine Class Representatives, and Polynomial Addition, have two characteristics that make them suitable candidates for systolic array processing: (1) each performs an operation on a sequence of elements that are streamed from the preceeding block, (2) the operation is such that it performs reduction of a sequence of elements from multivariate polynomials or Boolean expressions. For example, the Cover Cubes receives preliminary results from the CNF to DNF conversion and eliminates redundant cubes using Lemma 2. This is an operation that can be described as given a sequence C of cubes (C = c 1 , c 2 , . . . , c m ) determine a subsequence C ⊂ C such that it contains only cubes c i ∈ C where there exists no other c j / = c i ∈ C such that c i ∧ c j = c j . The following sections describe a generalized algorithm and systolic array-based architecture for the type of reduction operations common throughout our interpolation methodology. This is followed by a description of how they are utilized as part of the architectural data path.
Generalization of Reduction Tasks
The subtasks that deal with reduction in our interpolation algorithm can be generally described as follows. Given the sequence X = x 1 , x 2 , . . . , x m of elements from Σ, compute the sequence Y = y 1 , y 2 , . . . , y m where y i ∈ Σ ε = Σ ∪ ε by eliminating or combining elements based on binary comparisons between them. The empty element (ε) is introduced for the representation of operations where groups of elements can be reduced to a single element. Algorithm 2 shows a general form of these problems, where W and S are binary functions of the form Σ ε × Σ ε → Σ ε , and the symbol denotes parallel computation.
The tasks of distinctness, polynomial addition, and redundant Boolean term elimination (hereon referred to as cover) can be implemented by specifying W and S, as explained in the following subsections.
Distinctness.
The task of identifying the distinct elements of a sequence X = x 1 , x 2 , . . . , x m is needed within our interpolation algorithm to determine class representatives and filter out binomial duplicates. It can be implemented by Algorithm 2 with 
n-tuples
Chosen base A-priori biological knowledge We provide a correctness proof for this operation. Proofs of the other reduction operations given in this section are similar.
Lemma 4.
For an input sequence x = x 1 , x 2 , . . . , x m , where x i ∈ Σ, the output of Algorithm 2 with functions W and S defined as in (11) outputs x i1 , x i2 , . . . , x ik , ε, ε, . . . , ε where x i1 , x i2 , . . . , x ik is the subsequence of distinct elements of the input sequence.
Proof. We show by induction that after j ≤ k iterations of the outer loop,
where for each p ≥ j + 1,y p ∈ {x ij+1 , . . . , x ik } ∪ {ε}.
After j = 1 iterations, x 1 = x i1 . Next suppose that after j iterations,
for some j < k where y p ∈ {x ij+1 , . . . , x ik } ∪ {ε} for each p ≥ j + 1. Then either (1) y j+1 = x j+1 or (2) y j+1 = ε. In the first case, y j+1 = x ij+1 since x j+1 is not equal to any x p ,p ≤ j and is not changed by the j +1st iteration. In the second case, y j+1 = ε will be replaced by x ij+1 during the j + 1st iteration. Hence in either case,
where for each p ≥ j+2,y p is either ε or an element x ip , where p ≥ j + 2. Hence after i = k iterations,
and remains the same after i > k iterations since W(ε, ε) = S(ε, ε) = ε.
Polynomial Addition.
Assume that each of the elements in X = x 1 , x 2 , . . . , x m is a monomial. A polynomial addition operation can be implemented by Algorithm 2 with 
where q i ⊃ ε for any q i .
After computation with Algorithm 2 using (17), the result is X = x 1 , x 2 x 4 , x 1 , ε. Notice that there may be duplicate terms in the result but these can be easily eliminated using a distinctness operation.
Algorithm 2 is also amenable to other problems that can be expressed as the result of binary comparisons between every two elements of a sequence. For example, the problem of sorting a sequence S can be implemented with
Hardware Structure for Reduction Tasks
This section explains our proposed hardware structure and outlines the mapping for the reduction tasks. We first discuss the systolic array and cells with the assumption that the array is deep enough to process the input sequence without overflowing. We proceed by discussing the additional components that must be added to guarantee correct processing even if the overflow condition does not hold.
Systolic Array and Cells.
The proposed structure is a linear systolic array of n computational cells, each performing an operation deduced from W and S of the above discussion. Figure 2 illustrates the array and contents of the basic cell. Each cell contains two registers F and M, where each has enough resources to hold any single element from the alphabet Σ ε . Each cell also contains the logic to perform functions W(F, M) and S(F, M). All cells are initialized to the empty symbol. At each clock step a new element from the sequence X is input to the first cell and every cell performs a comparison between its stored value and the value from the previous cell and assigns (possibly) new values to the registers according to W and S. After all elements have been processed the M registers contain the resulting sequence and can be shifted out of the array using the connections between the M registers.
More formally, our linear systolic array computation can be modeled as a simplified version of the generic systolic
W(F, M) S(F, M) W(F, M) S(F, M) W(F, M) S(F, M)
F [1] M [1] F [2] M [2] 
Algorithm 3: Model of the linear systolic array computation. Figure 3 .
array presented in [16] , as shown in Algorithm 3. The notation F i [r] symbolizes the content of register F in cell r at iteration i. Figure 3 illustrates the operation of the array using cells that implement the distinctness task by defining W and S as in (11) . Observe that the operations implied by W and S and the connections between neighboring cells accomplish the comparisons established in Algorithm 2, albeit in a different order and in a parallel manner. In fact, the computation implemented by the systolic array can be interpreted as a reindexing of Algorithm 2 as presented in Algorithm 4. Table 3 shows the comparison indexes generated by Algorithm 4 for the example in Figure 3 . Notice that even though Algorithm 4 has reordered the indexes, the same data dependence is maintained as in Algorithm 2. In other words, if f (x i , x j ) is performed before a f (x i , x k ) (where j / = k) in Algorithm 2 the same order is preserved in Algorithm 4. The same condition holds for the order of f (x i , x j ) and f (x k , x j ).
Processing Sequences Longer than the Array Depth.
The functionality of the basic cells mandates their implementation on user logic inside the FPGA (rather than on embedded units, such as block RAMS). Depending on the application, the characteristics of the data sequence, and FPGA model, the FPGA might not have enough resources to instantiate a systolic array whose depth d sa is greater than or equal to m (the sequence length). In such cases, there is no guarantee that the array by itself will be able to compare each element against each other to obtain the correct solution. To illustrate this anomaly, assume that in the example presented in Figure 3 the input is a sequence of distinct elements X = x 0 , x 1 , . . . , x 5 . The array would be filled once x 3 is registered in the last cell, hence none of the cells would have operated on the combination of x 4 and x 5 . A control and temporary storage mechanism must be provided to reiterate the data through the pipeline if needed and guarantee that all elements are compared against each other. This can be accomplished by a scheme such as the one shown in Figure 4 through the addition of a FIFO, control unit and several datapath control elements (i.e., multiplexers).
The Overflow FIFO (OF) stores elements that make it through the array without having been eliminated or registered. These elements will have to be reiterated through the array to test their relation to other elements in the OF. reason why one may be able to increase FIFO depth and not necessarily array depth is that in FPGAs FIFOs are easily mapped to the embedded block RAM, whereas processing cells map to user logic.
The control unit controls the data flow between the array and the OF and determines how to reiterate the overflowed data through the array. For the reduction tasks presented in Section 5 we can deduce three basic modes of operation for the control unit. Figures 5 and 6 illustrate the need for these modes. Figure 5 shows the end of a first pass of a sequence through systolic arrays that implement (a) a sorting algorithm, and (b) sum of polynomials. 
To results memory
To results memory (i) In the case of a nonreducing operation like sorting ( Figure 5(a) ), the array will contain d sa sorted elements. Thus, elements in the array can be shifted out while elements in the OF are reinputted into the array. This must be repeated until the OF is empty after a pass, that is, at most m/d sa times.
(ii) When adding polynomials ( Figure 5(b) ), the array will contain at most d sa monomials with their final coefficients. Thus, elements in the array can be shifted out while elements in the OF are reinputted into the array. This must be repeated until the OF is empty after a pass, that is, at most m/d sa times. Figure 6 illustrates the various passes needed for the cover operation. At the end of a first pass the array will contain at most d sa cubes. However, in this case a first pass through the array does not guarantee that all cubes in the OF have been compared to all the cubes registered in the array. For example, in Figure 6 (a) cube x 1 x 6 passed through the array without being registered. Meanwhile, the last cube x 1 covered x 1 x 3 and x 1 x 2 in two cells of the array. Thus, we have at least one cube (x 1 x 6 ) in the OF which was not compared to a cube in the array (x 1 ). The elements in the OF must be refed through the (unflushed) array to allow every c i stored in the FIFO to be compared against every c j registered in the array. The refeed is repeated until a complete pass of the OF cubes through the array does not modify the cubes registered in the array. After this (Figure 6(b) ), if any c i in the OF is going to be part of the result, for example, x 4 , it must only be still compared to other elements in the OF. The contents of the array may be shifted out to a memory that stores the results, the array is reset and the OF cubes are refed (Figure 6(c) ). The three-step process continues until the OF is empty after an OF refeed. 
Further Implementation Considerations
Aside from the reduction tasks, the computational blocks in our implementation can be classified as performing one of two functionalities: element pair generation or multiple term multiplication. Element pair generators, such as the CNF Generator and the Binomial Generator in Figure 1 , input a set of tuples and output a distinct pair of tuples at each computational step. This is accomplished by using the architecture illustrated in Figure 7 . As tuples are input (each accompanied by a load signal) both RAMs store the tuples and an Address Control Generation unit counts the number of tuples. When a status signal indicates that all tuples have been received, the ACG begins generating all possible address combinations (a 0 , a 1 ) for 0 ≤ a 0 < n, a 0 + 1 ≤ a 1 ≤ n. The distinct pairs of n-tuples produced by the CNF Generator are fed to n log 2 p -comparators to compute the r(i, j) terms in Algorithm 1, as shown Figure 8 . In the case of the Binomial Generator, the corresponding terms within the distinct pairs of n-tuples are subtracted from each other and the results are used to determine the terms of the binomial for (3), as shown in Figure 9 .
Multiple-term multiplication/conjunction, such as needed in blocks CNF2DNF and Lagrange, is performed using the architectures shown in Figures 10 and 11 , respectively. The architecture depicted in Figure 10 receives DOLs from the CNF Generator, the DOL2Literals block generates a literal expression for each found in the DOL. The literals are queued in a circular FIFO and are conjuncted to each of the results in the Preliminary Results FIFO. After each DOL is conjuncted, the resulting cubes are passed through the Cover Cubes block which eliminates redundant cubes. A control unit monitors and generates signals to coordinate the data path activities.
The architecture depicted in Figure 11 performs the computations of (3). Each binomial that is received from the Binomial Generator is registered. Then each of its two monomials is multiplied by each of the monomials that has resulted from previous binomial multiplications. The monomials that result from the multiplication are input to the PrelimPolyAdd block which adds similar monomials to n-tuple j n-tuple i
. . . n-tuple j n-tuple i
Mux Mux
Priority encoder reduce the size of the preliminary product. PrelimPolyAdd outputs its result to the PrelimRes FIFO. The iterative process continues until the last binomial of a product is received, in which case each monomial of the preliminary product is output to the final Polynomial Addition block (see Figure 1 ). Once the last binomial of the last product has been multiplied, the (final) Polynomial Addition block is signaled and begins to output the monomials of the final result.
Results and Discussion
This section presents and discusses results for the individual reduction tasks as well as the complete interpolation algorithm implementation.
Reduction Tasks.
The systolic array cells for the subtasks of distinctness, polynomial addition, and Boolean cover were modeled using Verilog HDL, based on their function definitions presented in Section 5. The array and control units were also modeled using Verilog based on the descriptions provided in Section 6. Cell, systolic array and control unit models were designed as parameterizable modules in terms of width of elements, array depth and control mode, respectively. Models were behaviorally simulated using ModelSim SE 6.0 to validate their results against software versions of the algorithms. Xilinx ISE Design Suite 11.1 was used to synthesize the architectures for a Virtex-4 XC4VLX200ff1513-11 in the DRC Dev Systems DS2000. Default synthesis parameters were used, such as mapping FIFOs to block RAMs and speed as the optimization goal. All the reported results for the individual tasks are before Place and Route. For each of the three subtasks, several systolic array depths (d sa ) were implemented, then randomly generated sequences of various bit widths and lengths were input. Table 2 shows timing and resource results for (1) the cover operation on sequences of length 4096 and width of 32 bits, (2) the polynomial addition operation on sequences of length 4096, where each element represents a monomial of 8 variables in Z 5 , and (3) the distinctness operation on the same sequences as (3) . Results are compared against a software implementation compiled using the GNU project c++ compiler with optimization level 3 and running on a 3.40 GHz Intel Pentium D CPU with 2 MB cache and 3 GB RAM. CPU execution times were measured by accessing the processor's cycle counter. In all cases, we report the smallest measured CPU execution time. The freq column indicates the FPGA clock frequency, t F and t C are the FPGA and CPU run times, and speedup is computed as t C /t F . The numbers shown in parenthesis next to each resource quantity are the resource percentages for the target device.
Several important observations can be highlighted from the results in Table 4 . First, the maximum operation frequency is independent of the array depth, allowing performance to scale adequately with depth. In fact, given the simplicity of the required control unit and storage, the longest path was mostly determined by the implementation of functions W and S inside the basic cells. Second, the parallelism achieved through the use of the systolic array greatly compensates for the refeeds that must be performed when the result is longer than the array depth. For instance, the result for the cover operation was of length 3386 on average, yet there is a speedup of more than 40× for d sa = 256. Finally, the resource utilization to achieve competitive speedups is minimal and easy to estimate when scaling since it is almost completely attributable to the systolic array cells and overflow FIFO. Hence when using this structure as part of a bigger design, the designer could readily determine which parameters are best suited for the area/performance constraints and objectives.
Part of our purpose here is to argue that the reduction tasks can be implemented efficiently using the systolic array design. With Table 4 , we intend to demonstrate the effect of each particular operation for users that might want to use any of them as part of a bigger design. Clearly, the ultimate speedup may be dictated by other components of the final design, as shown by Tables 5 and 6.
Interpolation Algorithm.
We modeled the proposed interpolation algorithm in Verilog HDL using the reduction tasks and the architectures described in previous sections as building blocks. The behavioral simulation and synthesis tools as well as the software compilation parameters and platform were the same as in the reduction task experiments.
The results are shown separately for each of the two interpolation stages since, once the bases have been computed by the first stage, a scientist's intervention would be necessary to choose the most appropriate before proceeding to the second stage. For both stages, the synthesis tool determined maximum operating frequencies above 150 MHz, thus we chose 150 MHz for the experiments. The Xilinx ISE 11.1 place and route tools (with default effort level) were able to meet the timing requirements reported for Stages 1 and 2. We attribute this to the fact that the great majority of connections in our design are local and grow only linearly with n. hence smaller values of d SA produced better results since they incurred in less unnecessary overhead for filling and emptying. As the number of variables increases, the number of terms produced by Algorithm 1 increases exponentially, which is clearly evidenced by the run times, that is, columns t F and t C . The FPGA implementation achieves speedups of 13× to 67×, which mostly increase with n. The increase in speedup at the higher values of n can be attributed to a higher percentage of time of full pipeline utilization as well as the increased benefit of parallelism in the comparisons between the n-tuple variables, that is, the computation of the r(i, j) terms in Algorithm 1. As expected, the use of necessarily serial processing tasks such as the multiplication of the DOLs to obtain the DNF (see Example 2) limit the speedup achieved by the FPGA implementation when compared to the results obtained using the cover subtask by itself. Although the ultimate goal of reverse engineering might be understanding complete complex biological systems, recent reported models limit their exploration over specific subsystems or punctual mechanisms, for example, the study cell-cycle regulatory network of fission yeast and apoptosis (programmed cell death) [17, 18] , using a moderate number of genes between 8 and 20. Our results lead us to believe that the advantages of using the proposed architecture in reconfigurable logic will be sustained even at bigger sizes. As the sizes increase our architecture will have to rely increasingly on iterations rather than parallel computations. Nevertheless, due to the custom nature (finite field) of the data, even at these sizes the parallelism over each iteration will be enough to offer significant speedups over the alternative, fully serial implementation in general purpose processors.
Stage 1.
Speedup is maintained at a cost of increasing resource utilization. For the considered cases, BRAMs, which are used to implement the various FIFOs and RAMs of stage 1, is the fastest growing resource. This could pose a challenge to maintain performance at even higher n values. However, this issue can potentially be resolved by complementing the BRAM capacity with a dedicated external memory bank, as is commonly included in modern reconfigurable computer platforms [19, 20] . A study of the impact of external memory on the performance of the proposed architecture is future work. An advantage of our proposed architecture is that all the required memories behave as FIFOs. Thus, their accesses patterns are predictable and (as part of future work) we can use this information to device a buffering scheme that hides the latency of accessing the external memories.
The increase in user logic (Slices, LUTs) in higher n can be circumvented by using shallower systolic arrays (smaller d SA ), which still maintain competitive speedups. For example, although not shown in the table, for n = 13 speedups of 30× and 45× were obtained by using d SA values of 64 and 128.
Stage 2.
As explained in Section 2.2, our algorithm uses a base X i computed in Stage 1 to determine a set of equivalence classes D/≡ Xi from the original set of ntuples. The equivalence classes are then used to compute an interpolating polynomial using (3) . A greater number of variables in the chosen X i and a high variability among the ntuples translates to larger cardinalities of D/≡ Xi . The greater the cardinality of D/≡ Xi the higher the number of binomials computed by (3) , which constitutes the bulk of computation for Stage 2.
The timing results shown in Table 6 support the previous statements. The sets of 100 n-tuples from Stage 1 along with bases of 2 to 4 variables were input to Stage 2. Observe that since the data sets were randomly generated there was high variability in the sets of n-tuples, which resulted in similar D/≡ Xi cardinalities (column Class Reps) and run times (columns t F and t C ) for the various sizes of n (for a given number of variables in the chosen basis). The FPGA implementation achieves speedups of 16× to 25×, which mostly increase with n. The moderate increment in speedup as n increases is attributed to the increased benefit of parallelism for the generation of binomials, and the comparison and addition of monomials.
Resource utilization distribution in Stage 2 is roughly uniform between the user logic, that is, Slices, and BRAMs. Furthermore, the increment in resource utilization for increasing values of n and numbers of variables in the basis is quite moderate, for example, the percentage of slices goes from 8.69% for n = 8 to 12.87% for n = 13.
Conclusion
This paper presents a new methodology based on Lagrange interpolation with two important properties: (1) it identifies redundant variables and generates polynomials containing only nonredundant variables, and (2) it computes exclusively on a reduced data set. The analysis of the methodology for its hardware implementation led us to the identification of several reduction tasks which were generalized to a simple algorithm. The generalized algorithm can be efficiently mapped to a systolic array in which each processing cell implements a pair of binary operations between an incoming and a stored value. The tasks of Boolean cover, distinctness, and multivariate polynomial addition were implemented and served as building blocks to the rest of the application. The FPGA implementation of the reduction operations and the complete application achieved speedups of up to 172× and 67×, respectively, as compared to software implementations run on a contemporary CPU, with moderate resource utilization.
