We present an online arithmetic scheme for hardware evaluation of multinomials arising in Bayesian networks. The design approach consists of representing the multinomial in a factored form as an arithmetic circuit which is then partitioned into subgraphs and mapped to FPGA hardware as a network of online modules connected serially and operating in overlapped manner. This minimizes the interconnect demand without a drastic increase in computation latency. We developed a partitioning/mapping algorithm, designed basic radix-2 online operators and modules, and determined their cost/performance characteristics. We also evaluated the cost/performance characteristics of implementing a Bayesian network on an FPGA chip.
INTRODUCTION
The problem of evaluating multinomials (sums of multivariable products) arises in computing inferences in a Bayesian network (BN). A BN is a compact, graphical model of a probability distribution. 9 It consists of a directed acyclic graph with edges indicating direct causal influences among the nodes (variables) and a set of probability distribution tables quantifying these influences. BNs are widely used in problems where we need to answer questions based on uncertain information. Fig. 1 below shows the well-known ASIA 11 Bayesian network. ASIA depicts the casual structure of a patient having tuberculosis, lung cancer, or bronchitis based on several factors, one being whether or not the patient has recently been to Asia. Probabilistic inference, a common operation performed on BNs, involves finding the probability of an event based on the probability distribution defined by the BN. In the ASIA example, we may want to find the probability of the patient having lung cancer given that he has Dyspnoea.
Darwiche
5 has proposed a new approach for computing inference in BNs wherein he represents the BN using a multinomial and shows that inference can be answered by evaluating the multinomial. To reduce the number of operations, the multinomial is transformed into a factored form which corresponds to a network of multiplication and addition nodes called the arithmetic circuit 5 (AC) of the Bayesian network. While the multinomial is exponential in size, the corresponding AC is of manageable size. This is achieved by the use of sophisticated algorithms which exploit the global and local structure in the probability distribution.
In this paper, we discuss the design of a hardware arithmetic circuit for evaluating a BN multinomial whose AC representation is given. We consider using online arithmetic 7 to reduce the interconnection requirements of the arithmetic circuit consisting of many operators. A straightforward approach is to map the given AC to a network of online multipliers and adders. Another approach we explored is to use composite online operators, such linear system solvers 4 which, when applicable, have smaller online delay than the delay of online networks. In the next section we discuss obtaining AC forms suitable for implementation. We also discuss organization issues when AC computational requirements exceed available resources. In Section 3 we present the designs and delay/cost characteristics of the online operators: online adder, online multiplier, and linear systems operator (LSO). Section 4 presents estimates of the delay and cost of implementing the ASIA example on a Xilinx FPGA.
COMPUTING THE AC
The arithmetic circuit derived from the Bayesian network, as discussed in the last section, is a DAG (directed acyclic graph) in which nodes represent operators or values and edges show dependencies. The inputs of the AC are probabilities, θ, and indicators, λ. The θ values are taken from the probability distribution tables defined for each variable in the Bayesian network, and the λ values specify whether there is evidence that is or is not consistent with a variable.
For our purposes, the AC has a logical, as well as an arithmetic structure. The AC originally derived from the Bayesian network is captured by an NNF (Negation Normal Form) file, in which nodes are represented by a character literal (e.g., "A" for And ), followed by additional parameters that further specify the node. The NNF file currently contains three types of nodes: And, Or and Leaf nodes. And nodes are multiplicative nodes; when computing the probability of an event A And B, the probabilities are multiplied. The Or nodes are similar to And nodes; when computing the probability of an event A Or B, the probabilities are added. The Leaf nodes contain a probability and/or an indicator. The And and Or nodes are followed by the number of inputs the operator takes, and a list of indexes that specify the input. Every node has an implicit index according to its position in the file. The very first node in the file has an implicit index of zero. All children of a node will appear in the NNF file before its parent(s). Once the structure of the DAG is determined, we must differentiate between arithmetic and logical operators. Some And nodes will have as inputs, indicators as well as probabilities. Although multiplication and selection by an indicator are logically equivalent, for a hardware implementation we treat selection operations as logical operators and only perform arithmetic when necessary. To determine the nature of an And operator, we incorporate the LMAP file, which maps values to the leaves of the DAG. Leaves are specified in the NNF as an "L" followed by an index into the LMAP file. The LMAP file contains three types of leaves which are of interest to us: Indicators (I), Probabilities (P), and Hybrids (H) (which act as an Indicator And Probability, as shown in Fig. 3c ). By recursively examining the inputs to an operator, we can determine whether the operator is of a logic or arithmetic nature. Fig. 3a and Fig. 3b illustrate this concept. 
Partitioning and Mapping
Once an expression has been identified through the NNF and the nature of operations is distinguished with the LMAP file, we can compute inference for the given BN. The process in which the AC can be computed depends on the size of the AC and the hardware resource constraints available. If the AC is small enough to satisfy the hardware constraints, a direct mapping is performed to the operators described in section 3. If the AC demands greater resources than available, it must be computed in smaller iterations. The iterations to make over the AC must take into account dependencies, and sharing (to be discussed in section 2.1.1). Essentially, this involves a two stage process which firstly partitions the AC (a DAG) into subgraphs which are rooted trees. In order to compute the AC, all subgraphs must be computed in the order intrinsic to their dependencies present in the AC. Certain subgraphs will favor an online approach, and these subgraphs must be mapped to the available resources. If a subgraph demands a greater set of resources than available, it must be further pruned to create smaller trees which can be mapped to the resources. The prunings in the subgraph will be computed in the order specified by their dependencies. Once all the prunings of a subgraph have been mapped, the subgraph will be computed. This will enable further computation of other dependent subgraphs.
Partitioning
For a given node n in the AC, the predecessors (P ) of n, act as inputs to n. Also, for a given node n, S is the set of successor nodes which depend on n, then n is a shared node if |S| > 1. This is shown in Fig. 4 . Figure 4 . for a given node n, P is the set of predecessors, and S is the set of successors.
Shared nodes contain a value that can be shared anywhere in the network. This means that when computing shared nodes, the output must be saved, so if need be, the value can be used by other iterations. Algorithm 2.1 was used to eliminate shared nodes in the DAG, and create a set of subgraphs: rooted trees which contain no shared nodes. The output of a subgraph will be stored and used as inputs to other subgraphs.
Algorithm 2.1:
PartitionToSubgraphs(n, V, sg) is called with the root of the DAG as n, an empty set containing the visited nodes V , and a newly created root subgraph sg. Predecessor(n) returns a set containing all of node n's predecessors. This algorithm will create a new subgraph at every shared node which is an operator, and place that operator in the newly created subgraph. Then the algorithm recurses on the shared node's predecessors with the newly created subgraph.
Using this algorithm, the PIGS 12 and HAILFINDER 12 arithmetic circuits were partitioned. The charts in Fig. 5 show the distribution of the subgraphs which resulted from the partitioning. The horizontal axis shows the size of the subgraph as the number of operators it contains, and the vertical axis represents the frequency of the given subgraph multiplied by its size. This representation shows the total number of binary operations contributed by a subgraph of a given size. Mapping the AC to an online network requires careful partitioning to expose the characteristics which render online algorithms favorable. The delay through an online chain is characterized by the total online delay through the chain, ∆ = i (δ i + 1), plus the precision of the input, m, which comes about from the serial nature of the algorithm. Long online chains are favorable to amortize the delay contributed by the precision, and conceal the serial delay.
Intuitively, the algorithm is designed to create the largest possible subgraph. This partitioning results in a particular profile in the size of subgraphs created. At one side of the spectrum, are large numbers of small subgraphs, which have no depth, and hence no advantage for online algorithms. At the other side of the spectrum there are very large subgraphs which are deep and hence amortize the online delay. This profile indicates that both parallel and serial style arithmetic might be suitable for computing the AC. Large subgraphs would be partitioned by pruning the subgraph tree, and mapping it to the resources available. The different iterations required to compute large subgraphs will essentially represent a set of super-operators through which the arithmetic circuit is solved.
Mapping
In the mapping stage, we map a given partition to a set of resources available. At the moment, only the binary operators presented in Section 3 are available. Thus, every operator with a greater number of inputs is restructured as a tree of binary operators. The subgraph is still a rooted tree which makes mapping a somewhat trivial task.
Given a hardware resource constraint describing the availability of operators, we can prune parts of the tree which satisfy the resource constraint. These prunings will create the set of reconfigurations, S, required to fully compute the AC. Each reconfiguration is a super-operator for solving part of the much larger subgraph. If every super-operator, s i ∈ S, has a given computation time, C i , and each reconfiguration requires time t i , then the total time to compute the subgraph is equal to ∀si∈S C i + t i .
One must take into careful consideration that the subgraph still contains L-and nodes which are logical operations. These operations essentially either feed the value of their child tree to the next dependent node or feed a zero value. These logical operations must be fully exploited to avoid performing computations that become irrelevant due to a L-and node. This concept will be discussed further in the next section.
ASIA Example
The ASIA example introduced in Section 1 is used here as a working example to demonstrate how the set of super-operators for the AC are determined. Executing Algorithm 2.1 on the AC derived from compiling the Bayesian network shown in Fig. 1 will result in the subgraphs shown in Fig. 6 . As it can be seen, the shared nodes in the AC are the roots of the subgraphs created. From this point on we will only consider the largest subgraph shown, which is the only subgraph created in this example that will benefit from being computed in an online fashion. The subgraphs are subsequently checked for operators with more than two inputs, any such operator is expanded to a tree of binary operators. We now search for "logical and" (L-and) nodes in the graph by incorporating the LMAP description of the ASIA example. The nodes in the graph which are considered as L-and operators are marked as "A", shown for the subgraph in Fig. 6 .
In our example, the subgraph fits entirely on chip and does not require pruning of the subgraph or reconfiguration. In other words the ASIA example only requires one super-operator, which solves the entire subgraph. Of course for larger graphs this will not be the case and reconfiguration or efficient mapping to a constrained physical routing must be accomplished. At this point in time, we have established that all L-and nodes should be precomputed for a given subgraph before creating a set of super-operators. The resulting tree created through the precomputation of L-and nodes avoids any unnecessary computation. We have yet to examine the details of this approach to discover any overhead which this precomputation step may introduce.
ONLINE OPERATORS
This section describes the design and implementation of fixed-precision, radix-2 online arithmetic operators and the corresponding modules used in the arithmetic circuit derived from a Bayesian network. For each module, delay and cost as a function of precision are determined. Online addition and online multiplication are presented in Section 3.1 and Section 3.2, respectively. The LSO operators and modules are described in Section 3.3 and Section 3.4, respectively. For all modules, the radix is r = 2, m is the precision in bits, δ is the online delay, and t is the truncation precision used for residual estimation. The radix-2 signed-digits for all modules are encoded using the borrow-save notation, where x ∈ {−1, 0, 1} and x = x + − x − . For the OLM and LSO modules, the on-the-fly conversion/appending (CA) function is used to convert from the signed-digit representation to the two's-complement representation in a digit-serial manner. 7 The implementations presented are related to those of. 
Online Addition
The design and implementation of an online adder (OLA) with δ = 2 will now be discussed. Let z = x + y, where x k and y k are the digits of the inputs. Fig. 3.1a shows the implementation of the OLA.
The online adder in Fig. 3 .1a produces a result in the range [0,1) when |x| + |y| < 1, but with z = z 0 .z 1 z 2 . . .. In other words, the first output digit produced has a weight of 2 0 . The first digit of x and y consumed by all modules (LSOs, online multipliers, and online adders) are required to have a weight of 2 −1 . Thus, the online adder must be normalized so its output can be fed into the next module. It can be shown that if z 0 = 1, then the next non-zero digit must be -1. Likewise, if z 0 = −1, then the next non-zero digit must be 1. Using these facts, the normalization can be done using a simple state machine, as shown in Fig. 3 .1b.
The first digit, z 0 , produced by the online adder arrives at state S0. If this digit is 0, then no normalization is required and edge a is taken to state S2. However, if the first digit is non-zero, then edge b is taken to state 
S1
. A value of 0 is always outputted when the state machine is in state S0. The state machine remains in S1 (edge c) until a non-zero digit arrives, which then causes the state machine to go to state S2 via edge d. While in the S1 state, the value of the first digit is outputted. In state S2, normalization has completed and the received digit from the online adder is passed through to the output of the normalizer. Edge e is taken once the online adder has produced the m digits, and a new operation can begin.
From Fig. 3 .1a, the critical path through the online adder consists of two inverters and two full-adders. To avoid an extra cycle to perform the normalization, z j+1 is fed into the normalizer. The normalizer performs an exclusive-or of z + j+1 and z − j+1 to check for a non-zero digit. Thus, the critical path of the online adder and normalizer is equal to Eq. 1. T ff includes the clock-to-q delay from the flip-flops being read as well as the setup time for the flip-flops being written to. The implementation of the online adder with the normalizer in a Xilinx Virtex-4 FPGA requires 22 LUTs and 13 flip-flops, which fits in only 3 CLBs, and can be clocked at over 500 MHz. Note that the implementation remains the same for any m.
Online Multiplication
The design and implementation of an online multiplier (OLM) with δ = 3 and t = 2 will now be discussed. The five most-significant bits (MSBs) of the datapath, excluding the CA modules, are shown in Fig. 9a Fig. 9b shows the bit-slice for the repeated bits (RBs). For m bits of precision, there are (m − 3 − 1) instantiations of the RB slice. Fig. 9c shows the least-significant bit (LSB) of the datapath. Recall that multiplying an input by -1 in the two's-complement representation can be done simply by negating each bit of the input and adding a logical 1 to the unit in the last position (ulp). These "'carry-ins"' are accounted for in the LSB slice. 
T OLM = T ff + T 4-to-1 mux + T [4:2] adder + T 4-bit CRA + T SEL and SUB
The critical path of the online multiplier, T OLM , is equal to Eq. 5. T ff takes into account the clock-to-q delay from the flip-flops being read as well as the setup time for the flip-flops being written to. It is assumed that the SEL and SU B functions are performed in parallel and take the same amount of time. This assumption is valid for any 4-input, LUT-based FPGA. In a Xilinx Virtex-4 FPGA, frequencies of over 250 MHz have been achieved. Table 1 
Linear System Operators
Linear System Operators (LSOs) map an arithmetic expression into an equivalent Linear System of Equations (LS). Equivalence here means that when the LS is solved, one of the variables (typically the "first" variable) evaluates to the same or power-of-two scaled value of the original expression. If we accept that expressions can be represented as E-graphs, an LSO can also be defined as an operator that transforms an E-graph into a LS. Consider the expression and its AC shown in Fig. 10a . The LS produced by applying an LSO to this AC is shown in Fig. 10b . Solving the LS shows that y 0 is indeed equal to the value of the original expression.
Depending on the nature of the AC acted upon, LSOs can be classified into different types. The above LSO is an example of a Polynomial LSO or PLSO because it maps a combination of addition and multiplication operations, similar to a polynomial expression, into a LS. See reference 2 for more details on LSOs.
Mapping arithmetic expressions into LSs by using LSOs is done mainly to take advantage of existing, efficient online algorithms for solving LSs. The scheme proposed in reference 4 solves a LS more efficiently than the conventional scheme using a network of online adders and multipliers. However, one restriction in the use of the LSO approach is that inputs should be bounded. While unbounded inputs can be scaled to make them suitable for the LSO method, the extra cycles wasted to "descale" the final results can often neutralize the benefits of this approach. In general, for a linear system Ay = B to be solvable using this scheme, the following convergence requirements 4 should be satisfied:
In the context of Bayesian network ACs, LSOs can potentially help us achieve savings in cost/delay if applied carefully. The typical values taken by probability variables (between zero and one) are a bit on the higher side of the convergence upper bounds. Therefore, thoughtless application of LSOs to various portions of the E-graph may not gain anything and may even decrease the efficiency of the overall circuit. But, by careful analysis of the E-graph, regions could possibly be identified where the inputs can be proven to be bounded for all values of leaf input values. Such regions could exist, for example, after a cluster of multiplication nodes. Since probability values are always less than one, the output at each stage diminishes in value as computation propagates up the chain. These optimizations need further research.
LSO Modules
LSOs can be implemented using variations of the equation, y i = b i + a i y i+1 , where the subscript i indicates a row in the linear system. An n-row LSO results in (n − 1) modules with δ = 4. The n th module (i = n − 1) is of the form y n−1 = b n−1 and can be implemented using just a shift register or can be produced by a different module. The recurrence equation for the i th row is defined in Eq. 6. The square brackets denote the iteration index and the the second subscript denotes the digit index. 
The implementation of Eq. 6 will now be discussed. The two's-complement representation is used internally with a carry-save adder (CSA). As a result, the residual, w i , actually consists of two vectors, ws i and wc i . Thus, there are six terms in the recurrence equation. The sum of all the terms is multiplied by 2, which is implemented simply as a left-shift by 1 bit. Let the first two terms of the recurrence equation be defined as
th cycle, stored in a register, and is available in cycle j. This assumption will be validated later. The third term of the recurrence equation, 
(c) (a) Fig. 11a,ŵ i  4 [j] is computed. The reason is that in a Xilinx Virtex FPGA, there is no penalty for computing a 6-bit CRA rather than a 5-bit CRA. Now thatŵ i [j] is known, the SEL and SU B functions can be performed as described previously. Fig. 11c shows the least-significant bit (LSB) of the datapath. Recall that multiplying an input by -1 in the two's-complement representation can be done simply by negating each bit of the input and adding a logical 1 to the unit in the last position (ulp). These "'carry-ins"' are accounted for in the LSB slice.
The critical path of an LSO module, T LSO , is equal to Eq. 8. T ff takes into account the clock-to-q delay from the flip-flops being read as well as the setup time for the flip=flops being written to. It is assumed that the SEL and SU B functions are performed in parallel and take the same amount of time. In addition, it is assumed that the DBB block has the same delay as a [3:2] adder. These assumptions are valid for an implementation in any 4-input, LUT-based FPGA. The LSO module has a slightly more complex implementation than an online multiplier, resulting in similar delays and resource requirements. In a Xilinx Virtex-4 FPGA, frequencies of over 250 MHz have been achieved. Table 2 
SUMMARY
The AC for the ASIA Bayesian network (Fig. 6 ) has been implemented in a Xilinx Virtex-4 FPGA using a network of online operators from Section 3. For two-input operators, the original AC without any optimizations requires 55 multipliers and 20 adders. After incorporating the information from the LMAP file, the number of multipliers and adders has been reduced to 29 and 12, repectively. The main cause of the drastic reduction in operators is due to the fact that many And nodes turn out to be L-and nodes. The FPGA resource requirements for different values of m, the precision in bits, are given in the inference, a frequency of 200 MHz has been obtained for all values of m. As evident in Table 3 , the number of LUTs and flip-flops required is proportional to the precision.
The network of online adders and online multipliers takes T OL = 29 + m cycles to compute the inference. If LSOs were used in addition to OLAs and OLMs, the time to compute inference becomes T OL+LSO = 27 + m. The use of LSOs for this example produces very small savings in cycle time. However, the ASIA arithmetic circuit is so small that the main advantage of LSOs for BNs, the ability to span across multiple levels, is not fully utilized. It is expected that for larger BNs, the advantage of LSOs will be greater. Further research is required to quantify the benefit of LSOs.
The AC for the ASIA Bayesian network is small enough that it can be fully implemented in a single FPGA. For slightly larger ACs, multiple FPGAs can be used. The online scheme is really beneficial in this situation due to the fact that only two wires are needed for a module in one FPGA to communicate with a module in another FPGA. Therefore, the number of I/O pins on the FPGA is usually not the limiting factor in determining how much inter-FPGA module communication there can be, as would be the case if parallel arithmetic was used. For even larger ACs which cannot fit in the available hardware, a new approach must be taken as described in Section 2.2.
