Typical Field Programmable Gate Arrays (FPGAs) are generally used in signal processing, image processing and rapid prototyping applications. The integration of Silicon Germanium (SiGe) Heterojunction Bipolar Transistor (HBT) devices with CMOS allows a new family of FPGAs to be created. This paper elaborates new ideas in designing high-speed SiGe BiCMOS FPGAs based on the Xilinx 6200. The paper explains new methods to reduce circuitry and utilize a novel power management scheme to reduce power consumption. In addition, new decoding logic has been developed where the address and data lines are shared. These ideas have improved the performance of a SiGe FPGA to run in the 5∼6 GHz range.
Introduction
Field programmable Gate Arrays (FPGAs) have gained popularity due to their flexibility and wide range of applications. An FPGA consists of multiple copies of a basic programmable logic element or cell. Logic cells are arranged in a column or matrix on the chip. To perform more complex operations, logic cells can be automatically connected to other logic elements on the chip using a programmable interconnection network. The operating speeds of current CMOS FPGAs are around 70-250 MHz. These slow operating speeds prevent their use in high-speed digital system applications.
High speed FPGAs find applications in many research and commercial fields such as Digital Signal Processing, where digital filters need fast multipliers, adders, subtractors, flip-flops etc 1 . They can also be used in applications which involve high-speed broadband networks 2 , high-speed inline processing, image recognition and in the area of genome analysis 3 . The top-level architecture of a Silicon Germanium (SiGe) FPGA is shown in Fig. 1(a) . The block diagram of a single logic cell (a) (b) is shown in Fig. 1(b) . It consists of a Configurable Logic Block (CLB) and routing multiplexers 4 . This paper describes the design of a SiGe FPGA (with new features), which is compatible with the Xilinx 6200 architecture. Changes have been made to make it work optimally in the design environment. In spite of operating at high frequencies, the switching noise is less because Current Mode Logic (CML) has been selected for the logic cell design. CML is very similar to Emitter Coupled Logic (ECL) 5 . The only difference being that differential pairs are used for all signals and there is no need for a reference voltage. The SiGe 5HP Heterojunction Bipolar Transistors (HBTs) are high speed transistors with cutoff frequencies around 50 GHz 6 . All simulations were done using the IBM 5HP design kit and Cadence 4.4.6.
The SiGe HBT structure and its advantages
Si is widely used in radio frequency (RF) and microwave circuit applications because of its advantages such as high-quality dielectric, excellent thermal property, extreme abundance and easy purification etc. However, Si is not ideal from a device designer's point of view because the carrier mobility is rather small for both electrons and holes, and the maximum velocity that these carriers can attain is limited to about 1 × 10 7 cm/s under normal conditions. Hence, Si is regarded as a slow semiconductor 7 .
The SiGe HBT is one of the most successful bandgap engineering devices. It has the comparable performance to GaAs RF devices, while it can be fabricated at a significantly lower cost. In order to achieve higher performance, Ge is selectively introduced into the base region of the transistor. The Ge mole fraction in typical profiles varies from 3∼9%. From Fig. 2 , it can be seen that there exists a drift field in the base, which aids in the faster movement of minority carriers. This reduces the base transit time and hence increases the cutoff frequency. The smaller base bandgap of SiGe compared to Si enhances electron injection, producing higher current gain for the same base doping level compared to Si devices. SiGe HBT and Si CMOS can be grown over the same substrate because the process has strict processing compatibility with existing CMOS tool sets and metallization schemes. This technology is referred to as BiCMOS technology. Due to all these advantages, the SiGe 5HP technology (provided by IBM) was chosen for the FPGA design. Fig. 3 shows a simple CML Exclusive-OR (XOR) gate with input and output waveforms 8 . The rise and fall times of the XOR gate are approximately 17 ps and 13.6 ps respectively. The current is maintained constant at 0.6 mA by using a current mirror at the bottom of the tree. There can be up to 4 transistors in every branch of the tree. Correspondingly this would lead to 4 levels of logic, (0 to -0.25 V)(level1), (-0.95 to -1.2 V)(level2), (-1.9 to -2.15 V)(level3) and (-2.85 to -3.1 V)(level4) in a single tree. The number of levels is determined by the power supply (0 and -4.5 V) and the difference between levels is slightly more than one V BE (0.85 V). The 250 mV peak-to-peak voltage swings were found to be appropriate for the design. A level1 output can be converted to other level signals by using an emitter follower. Current mostly flows through one of the branches of the tree structure. This branch pulls either OU T or OU T low depending on which path is conducting according to the logic. Considering the XOR gate, if A=1 and B=0, OU T has to be pulled low. So the branch through which the current flows when A=1 and B=0 should be connected to OU T . The resistor which is the top of the tree structure in conjunction with the current source determines the swing.
Design Description
A disadvantage of CML is its high level of DC power consumption. The power consumption is directly proportional to the number of trees used. In CML designs, there is a constant current flowing in all the trees, so there is always a constant power level even if a tree is not being used. This is why power management techniques play an important role in all CML designs.
There are two general approaches to reducing power consumption. The first one being the use of a smart Computer Aided Design (CAD) software which generates the configuration stream to turn off parts of the FPGAs (such as I/O devices and those unused cells). The second approach is to reduce the speed of operation dynamically and thus reduce the power. The following discussion presents the hardware design of the power management strategies.
The logic cell is designed to have multiple power states: Fast, Non-Critical, Slow and Off. Before the logic cell is configured to operate in one of the four states, it should be determined whether that particular cell is used or not. If it is supposed to be used, the cell is put into one of the three "ON" states. When the cell is not required to operate at high speed, it can be configured to the Non-Critical mode, which reduces the power consumption. When the cell is only used at low speed, it can be configured to reduce the power consumption even further. Finally, when a cell is "OFF", it does nothing and consumes no power. The CAD software manages the state configurations during programming. The trade-off here is the CAD software is made more complicated and may need more time for compiling.
Power management brings up a number of issues that the circuit designer must be aware of. How much power is saved in the Slow and Non-Critical modes? How long does it take for the cell to switch from one mode to another? Can part of the FPGA work in the Fast mode while another part work in the Slow mode? How fast can FPGAs work in these three modes respectively? Of course, the CAD tools must also be extended to optimize the design for these issues.
A Widlar current mirror can be easily redesigned to output multiple reference voltages as shown in Fig. 4 . It can safely take 15 loads without obvious loading effect. Fig. 4 also shows a simple CML tree with three transmission gates on top of it. The current in the current mirror is 0.6 mA, 0.3 mA, 0.1 mA and 0 in the Fast, Non-Critical, Slow and Off modes respectively. The transmission gates control the mode in which the CML tree operates. Only one transmission gate is on at a time. The Schottky diodes are used to prevent shorts between V cc and V ee .
A main concern here is how fast the circuit can switch from one mode to another. The configuration switching speed is mainly limited by the Schottky diodes and transmission gates on top of the CML tree due to their introduced higher parasitic capacitances. An example of the current response is shown in Fig. 4 when the mode switches from Slow to Fast. The switching time of the current is around 37 ps, which is several orders of magnitude faster than the configuration time.
3.1. The CLB structure A simple implementation would be to design each multiplexer separately and then join them, but the power dissipation in such an implementation would be large. The original structure has been modified to make it suitable for CML. The objective is to achieve the same logic using fewer trees in order to reduce the power and propagation delay. Fig. 6 shows the redesigned structure. This structure can be implemented in just 7 trees (3-logic with 2 pairs of emitter followers), whereas the original structure required 11 trees. This 36% reduction in the number of trees leads to at least a 36% reduction in power dissipation. Fig. 7 shows the schematics of all three blocks. Since the number of trees in both the combinational and sequential paths have been reduced, the propagation delay is also reduced. The propagation delay for the previous CLB was 120 ps 9 and that of the new CLB with power management techniques is around 175 ps with the same number of loads in the Fast mode. The new architecture is a little slower but has better performance in terms of power. The larger propagation delay arises from the parasitic capacitances introduced by the Schottky diodes as well as reduced current from 0.8 mA to 0.6 mA in the current mirror 9 . The design is multiplexer based which drives the interconnect by emitter followers, thus gaining a much higher speed than CMOS FPGAs. Fig. 8 shows the schematic of a complete logic cell with configuration memory. The logic cell consists of a CLB and 4 routing multiplexers. Each multiplexer can be used to route the output signal of the CLB to the nearest logic cells (North, South, East & West). It also has a special multiplexer called the Magic multiplexer which is useful for corner turning (all the other four routing resources are straight). The previous architecture required 8:1 multiplexers for selecting the correct inputs to the CLB. These have been replaced by 9:1 multiplexers which implement the feedback from the Master-Slave Flip-Flop to the input multiplexers. The redesigned architecture can realize all the functions of the XC6200 architecture with fewer circuit trees.
Configuration memory
The FPGA has 2 memory planes for the configuration data. More memory planes can be added. In Fig. 8 , each memory plane contains a different set of configuration bits for the FPGA as well as the state of the latch in the CLB. It configures the state of the routing multiplexers and the function of the CLB. The CLB can change functionality by loading in a different set of configuration bits. The switching can be done dynamically. Each memory plane has 52-bits to program the logic cell (18-bits for Routing multiplexers, 7-bits for CLB functionality and 27-bits for 9:1 multiplexers). One part of the FGPA can work in one mode on one application while another part can be configured to work in another mode on another application.
X Pattern Decoding
A CAD software utility generates the binary data to configure the FPGA. This configuration is stored into the memory, which in turn makes every cell behave as desired. Unless an efficient decoding scheme is in place, programming may result in many long address and data lines. Long address/data lines will increase congestion. Fig. 9 shows a new decoding scheme which is more symmetric and has shared address and data lines. For a 4 × 4 FPGA design in Fig. 9(a) , there is one main-(a) (b) decoder and four sub-decoders. When the global enable line is set, the main decoder enables one of the sub-decoders based on the least significant 2-bits of the control signals. The enabled sub-decoder in turn enables one of the cell decoders based on the most significant 2-bits of the control signals. The cell whose decoder is enabled will get programmed by the data coming on the address/data lines. This is shown in Fig. 9(b) . There are four address/data lines reaching each cell decoder. The address of the memory location into which the data has to be written is sent first over the address/data bus. The address is registered by the rising edge of the enable signal. Each decoded address enables 4 SRAMs simultaneously. Next, the data which is to be written into the 4 enabled SRAMs is sent over the shared bus. By this method, it is possible to write into 2 4 × 4 = 64 memory locations by using only 4 lines. Using straightforward decoding scheme would require 6 address lines and 1 data line. This reduction is significant because all the address/data lines go through all the logic cells in the FPGA. Moreover, the normal decoding scheme would require a 6-64 decoder for each cell. Apart from this, there would be 64 lines going into the memory, which makes the layout more dense. The decoding logic has been implemented in CMOS to save power since speed is not critical here.
Measured results
A main concern for the new CLB is how fast it can run. In order to demonstrate the feasibility of the proposed circuits, a four stage ring oscillator (RO) was designed to test the propagation delay of the CLB in the Fast mode with a three stage and a two stage RO built for the Non-Critical and Slow modes respectively. Fig. 10 shows the testing mechanism. The CLB is programmed to be an inverter by writing the correct bits into the configuration memory. Several CLBs are connected from end to end to make a RO. The 50-Ω terminated pad driver outputs the signal to the oscilloscope. It should be noted that all the outputs of the CLB are differential, hence the RO can be consisted of any number of CLBs. These three ROs have been fabricated in IBM's 0.5 µm three metal layer SiGe BiCMOS 5HP technology 11 . The switches in the Widlar current mirror are fixed to facilitate the testing. Fig. 11 shows a die photograph of the prototypes. The CLB test chip die was tested as bare die on a Techtronix probe station using two GGB Picoprobe Multi-Contact Wedge probes. Each probe contained two sets of power and ground pins and six signal pins. Test chip output signals were measured using a Tektronix 11801C digital sampling oscilloscope with a SD32 sampling head through 50-Ω cables. The periods of the output waveforms are only determined by the internal test chip circuits, hence the exclusion of packaging parasitics as a result of testing bare die is not an important factor in the measurements. All the circuits have been tested with a supply voltage of 4.5 V. Fig. 12 shows the output waveforms from the test chip. The performance for major parts of the SiGe FPGA have been summarized in Table 1 . It is obvious from the table that this is a very high speed FPGA with significantly reduced propagation delay in the Fast mode. According to the measurements, the operating frequency of the new CLB in the Fast mode is 5.7 GHz. The main consideration for a bipolar FPGA is power dissipation. By using new architectures, which were discussed earlier, at least 36% of the power can be conserved in the Fast mode. If the FGPA runs in the Slow mode, it can save even more power (at least 83% of the power can be saved). The simulated propagation delay of the CLB in Fast, NonCritical and Slow modes are 171 ps, 329 ps and 641 ps, with the differences being 2.3%, 10.5% and 6.9% respectively between the measured and simulated results. 
Conclusion
Low power and high speed is an eternal goal for circuit designers. SiGe is an obvious solution that combines low power CMOS and high speed bipolar together. However, in order to scale up the FPGA significantly, serious power management scheme must be in place. This paper presented several ideas such as the novel power control scheme, X pattern decoding and architectural changes etc. The multiple power states allows the CLBs on the critical path to run in the Fast mode while other CLBs can be configured to operate in the Non-Critical, Slow or Off mode without jeopardizing the throughput. The FPGA design can also be made more efficient with a smaller layout by using the new decoding logic and reduced circuitry. All these techniques make it viable for gigahertz FPGAs with reasonable sizes, rendering it applicable for high-speed digital applications. There are many other ideas yet to be implemented and tested as part of this research effort, such as improving the FPGA architecture, reducing the tree height and using faster BiCMOS technologies as they become available. 
