An investigation of the implementation and optimization of Beneš Permutation Networks on Field Programmable Gate Arrays (FPGAs) is presented. Specialized design automation tools were used to achieve high performance and efficient area utilization. These tools were used to explore alternative placement and routing strategies, and to take advantage of the underlying FPGA resources. A significant improvement was obtained as compared to standard approaches based on Hardware Description Languages (HDLs) and general synthesis and place and route tools. In addition, several general improvements and extensions were discovered that further improve performance and reduce area.
I. Introduction
This paper discusses an FPGA implementation study of the Beneš Permutation Network (BPN). The BPN [BE65], originally developed for connecting devices in telephone switching, is a circuit of size )) log( ( n n O and )) (log(n O depth, built from 2 2  switches, which is capable of performing the " cross bar switch" function of routing any input in the network to any output. The BPN provides an asymptotic improvement in area over the straightforward network built with multiplexers, and the work presented here shows that an FPGA implementation uses less area for networks as small as size 4. The BPN is obtained from a recursive application of a three step construction proposed by Clos [CL53] which builds a permutation network of size n = mk out of three stages of m m  and k k  networks. In addition to applications to telephone switching, the BPN can be used to route data in a parallel computer, and an FPGA implementation can be used for multi-processor FPGA designs [FE81] .
The implementation presented in this paper uses a special-purpose tool to synthesize and place and route the circuit. The tool allows greater control of routing choices and better utilizes the underlying FPGA resources than is possible using the standard design flow based on the use of a hardware definition language and general synthesis and place and route tools. In addition a tool is provided to configure the switch settings for any desired routing of the BPN. It is shown that the resulting implementation has significantly better area utilization and timing performance than was obtained using general-purpose tools. Moreover, for regular circuits such as the BPN the construction of a tool to generate, synthesize, place, and route the circuit is fairly simple to implement, and it should be possible to substantially generalize the tool so that it is capable of implementing a larger family of circuits. Section 2 reviews BPNs and presents some improvements and enhancements discovered during this study. Section 3 summarizes the implementation approaches that were investigated and presents the design automation tools that were developed, and Section 4 presents empirical performance data and comparisons.
II. The Beneš Permutation Network
A Beneš Permutation Network (BPN) is a recursively defined circuit that permutes n signals [BE65] . The BPN is a non-blocking permutation network of size )) log( ( n n O , which is capable of performing the " cross bar switch" function of routing any input in the network to any output. The network consists of two-signal externally-controlled switches that can route the signals either through or cross-switch them. The network has   2 / 1 log 2  n n switches: 2 2 log 1 n  columns of / 2 n rows of switches. An 8-signal example is shown in Figure 1 . The network consists of two recursively defined 4 signal networks, which are indicated with dashed lines. The initial column of switches maps one output to each recursive network and the final column of switches receives one input from each recursive network. The network connections are symmetrical about the center column of switches and use perfect shuffle and inverse perfect shuffle mappings. The perfect shuffle mapping corresponds to bit rotations of the binary addresses of the wires. Right bit rotation is used on the left half of the network, and left rotation is used on the right half. Given a permutation, the setting for each switch in the network can be easily determined by a simple recursive algorithm. Input switches are set so that the two signals that must arrive in any output switch are mapped to different subnetworks (two signals that meet at an exit switch are called " mates." ). The following algorithm always guarantees this property.
1. Choose an arbitrary entry switch and set it straight (our software always chooses the top entry switch as the first current switch); 2. Repeat the following actions: a. For current entry switch, find the mate of the signal connected to the bottom subnetwork and wire it (the mate) to the top subnetwork; b. Choose the entry switch to which the mate is connected as the new current entry switch; c. If the new current entry switch has already been set, terminate the loop; 3. If every entry switch has not been set, repeat the process ignoring the entry switches that have been already set. The subnetworks' switches are set recursively. Pedersen, Ruslanov & Johnson
An example of the switch setting algorithm is shown in the Figure 2 . The top entry switch is set through (always). The signal directed to the bottom subnetwork is 5. Its mate, which can be determined from the exit switches, is signal 4. Thus, signal 4 is sent to the top subnetwork and the second entry switch is set crossed. The next signal that was directed to the bottom network is 2, whose mate is 3, which has already been mapped to the top subnetwork. The next remaining switch (which is switch 3) is set to through with the signal 7 directed to the bottom subnetwork. Its mate is signal 6, which implies that the last switch is set cross. Since the switch-setting algorithm always sets the first entry switch straight, that switch is redundant and can be eliminated. Such optimization reduces the number of switches used by the Beneš Permutation Network algorithm by / 2 1 n  , with the total number of switches needed being  log 1 n n n   .
In the literature the BPN always assumes that the number of inputs is a power of two, so that at each recursive stage the number of inputs can be partitioned into two equal subsets. The following construction generalizes a BPN to an arbitrary number of inputs. This obviates the need for padding the inputs and can save substantial space. If n is even, we can recursively construct the two subnetworks of size / 2 n     and / 2 n     . If n is odd, the last entry switch and the last exit switch can be removed and replaced by a single wire as shown in Figure 3 . Table 1 shows the number of switches required to implement a BPN of arbitrary size n, where n = 2 through n = 16. 
III. FPGA Implementation of the BPN
This section reports on several implementations of the BPN using the APEX 20K series of FPGAs manufactured by the Altera Corporation. In order to understand the performance study in the following section it is necessary to review the organization of these devices (see [WAC] for more details). A hierarchy of resources is used in the APEX FPGAs. At the lowest level a LUT and a flip-flip are paired to form a Logic Element (LE). The APEX series provides between 1200 and 51,840 LEs per FPGA. At the next highest level, ten LEs are grouped into a Logic Array Block (LAB). At the highest level in the hierarchy, multiple LABs are grouped into a MegaLAB, with APEX MegaLABs containing either sixteen or twenty-four LABs each, depending on the FPGA type. Each APEX FPGA contains multiple MegaLABs. In addition to the logic resources, each level in the hierarchy has its own dedicated interconnection network.
Two different approaches were used to implement BPNs. The first uses the standard design methodology starting with a description of the circuit in the hardware description language VHDL [WAC] . The second approach uses an FPGA design tool made specifically for the task of implementing BPNs. This approach, while specialized to BPNs provides greater control over the use of the underlying FPGA resources, and as will be shown in the next section leads to better performance and utilization results.
It is straightforward to code the BPN in VHDL using a for-generate statement to instantiate switch components. The switches are connected, as described in the previous section, using a function to rotate the bits of the switch label. The VHDL code was then synthesized and compiled as shown in Figure 4 . In this methodology, the HDL synthesizer converts an HDL representation of the design into an Electronic Design Interchange Format (EDIF) representation. The FPGA compiler then converts the EDIF representation into a bit stream that is used to implement the circuit in the FPGA. The EDIF representation that the synthesizer generates consists of a list of the FPGA resources to be used to implement the circuit, as well as the interconnections between these resources. The EDIF representation does not contain any physical assignment of FPGA resources (which are known as " placement directives" ), and it presents the design in a single hierarchical level regardless of the complexity of the HDL design.
This methodology has been widely adopted because of its ability to efficiently create satisfactory designs for a wide range of applications. However, for very high performance designs this methodology is usually insufficient, and it must be augmented by the application of directives to the compiler in order to obtain satisfactory performance. This is typically done independently from the HDL synthesis, and it commonly takes the form of manual and/or semiautomatic insertion of constraints into the compiler. These constraints typically attempt to improve the timing performance of the design. They are needed because the EDIF representation of the design does not convey the timing relationships that are necessary to achieve the desired performance. In the absence of constraints, the compiler is free to apply its general-purpose placement and routing algorithms, with results that may not be optimum in all situations.
Variability in FPGA circuit timing performance is primarily due to differences in interconnect timing between circuit elements. FPGAs contain a hierarchy of routing resources, each of which have different timing delays. Connections between closely spaced FPGA resources have delay times of as little as hundreds of picoseconds. Interconnects between nearby resources have delays on the order of nanoseconds. And for distantly spaced resources the delays are on the order of tens of nanoseconds. Timing performance variability arises from the various ways in which the FPGA compiler can assign (or " place" ) the FPGA resources. When it places these resources in close physical proximity on the FPGA, there are lower interconnect delays in the circuit, and higher performance is obtained. When it places these resources more distant from each other, higher delays result, and poorer performance is obtained. The uniformity of timing performance is also affected by this methodology. This effect can be seen in circuits like the BPN, which contains many identical parallel channels. Unless a uniform placement strategy is used for all of the channels, there will be a wide variation in the circuit' s timing performance from one channel to another.
The second implementation approach uses an FPGA design tool that replaces the HDL synthesizer. The methodology used with our tool is shown in Figure 5 . The significant difference in this methodology is that the tool simultaneously generates FPGA resource placement constraints along with the EDIF representation of the design. In this manner, circuit timing performance is directly controlled by the tool, which provides a mechanism for automatically placing those resources that require short interconnect delays close to each other. HDLs are used to abstract the details of the FPGA from the designer in an effort to enhance the general efficiency of the design effort. These otherwise abstracted details can sometimes be exploited to optimize a circuit in ways unforeseen by the synthesis designers, and which therefore might not be readily available to the HDL designer. We present two examples of BPN optimization techniques that use non-standard applications of FPGA resources. In one of these optimizations, the Logic Element (LE) resources used to build the BPN are constructed out of sub-LE pieces normally used to build the carry logic used in adder circuits. This technique provides a nearly 50% reduction in the FPGA resources used by the circuit. In the other optimization, FPGA resources typically used for cascading logic functions (where FPGA resources are combined to produce complicated logic functions) are employed to reduce the number of stages in the BPN by nearly half. This has the effect of improving circuit timing performance by almost a factor of two.
Our tool was designed to generate files that would be accepted by Altera' s Quartus II FPGA compiler. The tool consists of three parts: 1) the user interface, 2) the EDIF generator, and 3) the Compiler Settings File (CSF) generator. The user interface is the method for indicating the high level design concepts to the tool, and in the case of the BPN it was used to indicate the size of the desired network. The EDIF generator creates a file that contains declarations for all of the FPGA resources and the connections between the resources. The CSF generator creates a file that indicates physical locations in the FPGA for all of the resources used in the circuit. The Quartus compiler is forced to use the placements indicated in the CSF file. The compilation process will fail if the compiler is unable to route the design for this placement.
An EDIF file consists of four main parts: 1) the header, 2) the port connections for the circuit, 3) the logic cell instantiations, and 4) the connectivity instantiations. The header contains descriptions of the FPGA resources for the family of device being used. It typically varies little from design to design. The ports are the " external" electrical connections to the circuit, which can either be external FPGA connections (" pins" ), or connections to other FPGA circuits linked to together by the compiler. The generator calculates the number and type of FPGA resources needed to implement the circuit, and instantiates them. The generator also calculates the connectivity between the resources, and instantiates all of the circuit nets necessary to implement the connections. All of the logic resources allocated in the EDIF file are assigned to physical FPGA resources in the Compiler Settings File (CSF) by the CSF generator. The CSF file has three parts. The first part is a header that contains directives for the FPGA compiler' s report generation and default device options. The second part is of most interest here; it contains the FPGA resource placement directives. The third part of the CSF file contains other (nonplacement) settings for the compiler.
The detailed circuit for one switch as implemented in an Altera APEX FPGA is shown in . The circuit uses two FPGA Logic Elements (LEs, sometimes also referred to as an " lcell" ). Two 4-input Look Up Tables (LUTs), one from each LE, are used to construct the switch. The switch data inputs are terminals dataa and datab, the data outputs are combout_0 and combout_1, and the state of the switch is controlled by datac. When datac is low the dataa and datab inputs are passed straight through to outputs combout_0 and combout_1, respectively. When datac is high the dataa and datab inputs are switched to outputs combout_1 and combout_0, respectively. Pedersen, Ruslanov & Johnson The generator calculates the number of LEs for the network using the formula
(since there are two LUTs per switch) and then declares all of the necessary LEs. The generator then calculates connectivity using the bit rotation algorithm, and instantiates all of the necessary circuit nets to implement the connections. Figure 7 shows a schematic of a 4-input network circuit, along with the reference designations assigned by the generator. 
IV. Timing and Area Results
This section provides timing and area results for different implementations of the BPN. The results provided show that improved performance, compared to the standard HDL-based methodology, is obtained using the specialized tool with routing strategies geared towards the BPN. Additional performance and area improvements are obtained from the tool be utilizing device specific enhancements based on carry and cascade logic. Finally, data is provided that shows the BPN outperforms a multiplexer-based implementation for networks as small as size 4. Pedersen, Ruslanov & Johnson
BPN Circuits Generated Using HDL
The code for the VHDL entity, along with the function definition and other assorted constructs related to this design were synthesized using Leonardo Spectrum. The synthesis results for various sizes of n are shown in Table 2 . Of particular note is the fact that the number of logic cells precisely matches the predicted results, with two LEs per switch, indicating that the synthesizer was unable to optimize this design beyond the theoretical limits predicted by the Beneš algorithm.
Table 2: Synthesis Results

BPN Circuits Generated With the Enhanced Tool Set
Four different FPGA placement methods of a 16-input BPN were used to study the relationship between timing performance and placement. Each of these methods used " rules" that generated placements which emphasized one type of interconnect routing strategy over another.
The LEs were assigned in the following manner for Method 1 (shown in Figure 8 ). Assignments started at the upper left corner of a MegaLAB with the LEs being assigned using the same relative address order as the switch position identifiers. As a LAB was filled the next LEs were assigned to the LAB directly to the right in the same MegaLAB. This placement represented a dense population of the BPN circuit within a single MegaLAB. Figure 8 is color coded to indicate the type of hierarchical interconnect used to connect the LE' s output to its destination. The color code and the delay for each type of interconnect is shown in Figure 9 . The Method 1 interconnect is characterized by a preponderance of MegaLAB Fast Track interconnects, which have nominal propagation delays of one nanosecond, and also by Local interconnects, which have nominal delays of 300 picoseconds. Note that since the color code indicates the interconnect routing method used to connect the LE' s output, the sixteen LEs used in the last stage of the network typically have interconnects to remote FPGA resources. This can be seen by their assignments to Row Fast Track, Row & Column Fast Track, and Column Fast Track interconnects. The LE interconnect types for the last stage of the network are shown to indicate the use of these LEs in the network, however the timing delays of these interconnects were not included in the timing analysis. Figure 10 ) used a placement rule that also restricted the BPN to a single MegaLAB, but in this case the LE packing was not dense. Rather, each stage of the network was segregated into separate groups of LABs. While this had the effect of confining the interconnects within the MegaLAB, thereby achieving relatively high performance, the number of Local interconnects was greatly reduced. Seven columns of LABs were assigned to the circuits in Method 3 (shown in Figure  11Error ! Reference source not found.) and Method 4 (shown in Figure 12) . The difference between Methods 3 and 4 is where the network was split into different MegaLABs, with Method 3 being split along the principal axis of symmetry in the network, thereby minimizing the number of Column Fast Track interconnects running from one MegaLAB to another. 
Method 2 (shown in
Timing Analysis
The timing of the circuits for the four different placements, shown in Figure 8 through Figure 12 were studied, and were compared to the circuit generated using the HDL-based methodology. Timing data was generated by applying the CSF files for each placement method to the Altera Quartus II FPGA compiler tool (Version 1.1, Build 155 07/18/2001 SJ). Note that the same EDIF file was used for all four placement methods. The Quartus timing simulator was then used to obtain timing data for each placement method.
The timing results were obtained from the Quartus Compilation Report, specifically from the " tpd" (pin-to-pin delays) Timing Analysis. In the BPN, as in most logic circuits, there are multiple circuit paths from a given input to a given output. The longest reported delay represents the worst-case timing through the FPGA for the longest logic path through the circuit (the " worst, worse-case" timing figure). Shortest path delays are also reported by Quartus, however only the longest delay values are analyzed here, since these are of most interest in the context of a circuit that must retain flexible resource utilization, which the BPN must do to be a nonblocking network.
The APEX input receivers and output drivers are not located in the MegaLABs, so there are appreciable and varying interconnect delays between the input receivers and the first LEs in the BPN circuit, as well as between the BPN circuit and the output drivers. The timing figures were normalized to remove these input and output delays.
Variability in FPGA timing performance is primarily due to differences in the types of connection paths used in the circuit. The shortest delays are for the Local fan-out connections, which are used to connect LEs in adjacent LABs in a MegaLAB. Next fastest are the MegaLAB Fast Track fan-out connections, which are used to connect LEs in non-adjacent columns in a MegaLAB. The longest interconnect delays in the BPN circuit are the Column Fast Track fanout connections, which are used to connect LEs in different MegaLABs.
Timing Results
The summary analysis of the timing data for the four different placement rules is shown in Table  3 for the longest delay paths, with timing data for the VHDL-generated circuit shown for comparison purposes. Timing data for the four placement methods used by the tool set indicate that all of these circuits have better timing performance than the VHDL-generated circuit. The statistics for the tool set circuits indicate that there is a very tight distribution of the propagation delays, as evidenced by the standard deviations being, in general, about 5% of the average timing value. The circuits which used the first and third placement rules have the best performance (however it is interesting that the circuit which used the fourth placement rule has the best performance in terms of minimum delay). The average performance of the Method 4 circuit is nearly as good as the Method 1 circuit. However, the Method 2 circuit is clearly a poor choice. It has the worst timing results for both average and minimum delays, and it is also inefficient in terms of area allocation in the FPGA.
Optimized Implementations of the BPN
Improved area utilization and timing performance is obtained when built-in carry logic and cascade logic is used. The use of the logic cell' s carry logic can reduce both the number of logic cells as well as the number of general circuit net connections needed to implement the BPN by half. The logic cell carry in and carry out connections, which are normally used for the arithmetic operations addition and subtraction, are used in this technique to route switch element connections instead. When carry logic is used the four-input, one-output LUT is essentially configured as two three-input, one-output LUTs. One of the inputs to both LUTs is the logic cell' s carry input, with the other inputs being the logic cell' s dataa and datab inputs. One of the LUTs drives the logic cell' s combout output, and the other drives the logic cell' s carry out output.
The use of the logic cell' s cascade logic can reduce the number of " stages" (columns) needed to implement the BPN by nearly half, thereby improving the timing performance of the network. The logic cell cascade in and cascade out connections, which are used to " expand" the number of logic operands for functions of more than four variables, are used in this technique to " collapse" two switch stages into one. This technique also reduces the amount of logic resources used to implement the network. With this technique, only an even number of switch stages can be collapsed. The BPN always has an odd number of switch stages, which is why the number of " stages" used by this technique is always slightly more than half that of a conventional BPN implementation. This technique cannot be used in combination with the previously-described carry logic enhancement. Table 4 shows the number of Logic Elements (LEs) required for various techniques of implementing the BPN for various sizes of n. The number of LEs required to implement a permutation network using multiplexers is also shown for comparison purposes, since this is perhaps the most " natural" way for a designer to implement a permutation network in an FPGA.
