This paper presents a novel architecture for on-chip user storage in an FPGA. Shallow wide memories are provided by allowing the user to write and read the configuration memory in unused switch blocks. Implementing this technique on a 100 x 100 logic block FPGA with 128 tracks per channel gives a total of 9.46 Megabits of memory which can be accessed using an arbitrary word width. The circuitry to provide access to the configuration bits in this way consists of 3.58 transistors per bit, and results in a routing speed degradation of 1 1.6% (typical critical path degradation of 5 %).
Introduction
On-chip user storage has become an essential component of todayk high-density FPGAs. FPGAs are often used to implement large systems, and these systems require highspeed buffers, tables, and other memory functions. Typically, vendors support these requirements by embedding large RAM blocks [1, 2, 3, 4] . The RAM blocks provide a very dense implementation of storage, but they are not well suited to the implementation of wide, shallow, distributed memories, which are essential to many switching and DSP applications. These RAMS typically provide between 8 and 32 bits per access, meaning wide memories must be constructed by cascading several blocks. In addition, the RAM blocks in current devices are embedded at a fixed location on the chip, with all the addressing and data signals routed to and from that point.
One method for supporting wide shallow memories is employed by Xilinx and Lucent [1, 2] . In their devices, the 16-bit lookup-tables within each logic element can be configured as a RAM. Since each logic block can be configured as RAM individually, the memories can be placed close to the attached logic, and several logic blocks, each configured as a RAM, can be combined to implement wider structures. Unfortunately, the amount of storage is limited to 16 bits per lookup-table, which may not be sufficient for some applications.
In this paper, we present an alternate implementation of onchip memory.
In our architecture we re-use the configuration memory within each switch block, providing the user with access to this memory through additional circuitry. Normally, this memory is used to store the connection patterns required to implement the user circuit. Typically, however, not all the switch blocks in an FPGA are used to transport signals. We show that, by adding only a modest amount of circuitry, the configuration memory in these unused switch blocks can be used to implement wide, shallow buffers and other similar memory structures.
Utilizing unused switch blocks in this way has a number of advantages. The amount of storage within each switch block is significant; an FPGA with 128 trackdchannel contains 1088 configuration bits within each switch block. As we will show, the structure of the switch blocks leads to a natural implementation of wide, shallow memories. In addition, since the switch blocks are evenly distributed across the FPGA, the memories created from their configuration bits are not restricted to fixed locations on the device. In order to realize these advantages, however, it is important that the additional circuitry does not slow the switch block unacceptably, or result in a significant area overhead. This paper is organized as follows: Section 2 outlines the baseline FPGA architecture. Section 3 then presents the enhanced switch block design, and proposes a method of control. Section 4 presents an analysis of the size and number of memories that can be created, along with the speed and area impact of the proposed enhancements. Finally, Section 5 discusses the implications of this new architecture on the CAD tools.
Architectural Framework
In this paper, we focus on island-style FPGAs [SI. Each horizontal and vertical channel consists of 128 parallel tracks, each track spanning four clusters (although the concept can be applied to FPGAs of any channel width and segment length). At the intersection of each horizontal and vertical channel is a Wilton switch block, Shown in Fig. 1 (for clarity, the figure only shows six tracks per channel, however the extension to 128 tracks per channel is clear). We assume that the tracks within each channel are staggered so that one quarter of the tracks startherminate at each switch block. As shown in at a switch block can be connected to one track in each of the other three incident channels, while the tracks that pass through a switch block can be connected to two tracks in each of the two perpendicular incident channels.
Each programmable connection within the switch block is a bi-directional re-powering programmable switch, containing buffers and two configuration bits, as shown in Fig. 3 . The value stored in each bit is set during configuration; once the chip is configured, the value in these bits do not change. In the enhanced architecture, each switch block can be optionally used as a memory with 8 words of up to 124 bits each. Rather than adding new memory bits to the switch block (as in [7]), our approach is to provide circuitry to allow the user to write and read the configuration memory corresponding to all diagonal connections within the switch block. Assuming 128 tracks per channel, there are 512 diagonal programmable connections within each switch block. We group these programmable connections into 128 groups of 4 connections each; each group implements a single bit position of the memory. Since each connection
Memory Organization and Access Modes
contains two configuration bits, each group contains 8 bits of storage. By providing appropriate select circuitry, any one of these 8 bits can be read or written to; thus, the switch block acts as a 8x128 bit memory. As described below, to allow for address and control lines, our architecture can only use up to 124 of the 128 groups, thus, the maximum memory size we can support in a single switch block is 8x124.
When used as a memory, the switch block can be connected to the incident tracks in one of two modes. In the first mode, the data input for each group is taken from a track in the horizontal channel incident to the switch block and the data output of each group is supplied to a track in the vertical channel incident to the switch block. In this mode, the address bits and write enable signal are selected from four tracks in the horizontal channel as described in Sect. C. In the second mode, the role of the horizontal and vertical channels are swapped; the data inputs, address bits, and write enable signal are taken from tracks in the vertical channels and the data output bits are supplied to tracks in the horizontal channels. In both modes, the non-diagonal connections are closed, meaning that connections to the horizontal channels can come from either the right or the left side, and connections to the vertical channels can come from above or below.
B. Enhanced Programmable Bi-Directional Connection
The key to our architecture is an enhanced programmable connection element (EPC). As Fig. 4 shows, the EPC provides access to its two configuration memory bits by linking each configuration memory cell to a horizontal or vertical track through pass transistor circuitry. Each of the two configuration bits in the EPC are connected to an intermediate staging point through pass transistors M3 and M4. These transistors are controlled by lines CA and CB, the values of which are generated by the address control circuitry, as described in Sect. C. All four connections within a group share the same staging point. From this intermediate staging point, pass transistors M5 and M6 connect to a vertical and a horizontal track respectively, with these transistors controlled by the MemVert or MemHorz lines (discussed in Sect. C) which allow access to the memory from either the vertical or horizontal direction.
Additional pass transistors M7 and M8 are used to isolate the two sides of the programmable connection when the switch block is used as a memory. When not used as a memory, transistors M7 and M8 are closed, allowing the circuit to operate as a normal programmable connection. 
I C: Memory Addressing and Control
The eight word memory requires three address bits, as well as a read/write control signal. Rather than tying these inputs to four specific incident tracks, the address and control inputs are chosen from the entire set of incident tracks in either the horizontal or vertical channel, as shown in Fig. 5 . Each address and control line can be selected from one quarter of the incident tracks; those tracks chosen to supply alddress and control information are, of course, not used for data. The input lines to the control circuitry are shared between the horizontal and vertical incident tracks using four two input multiplexors (only shown for line A2 in Fig.  5 ) . The address and control signals are fed to a 3-to-8 decoder; the eight outputs of the decoder are used to drive the CA and CB lines of one of the four programmable clonnections in each group.
In addition to this circuitry, two configuration bits are needed per switch block. The first configuration bit is used to indicate whether the switch block is acting as a memory or a13 a routing switch. This bit is connected through an inverter to M7 and M8 of the EPCs.
The second configuration bit indicates whether the data inputs are connected to the horizontal or vertical channel (the data outputs are connected to the other), and is used both to control the previously mentionedmultiplexors, and with the rt:ad/write signal to produce the Mem Vert and MemHorz signals. These signals control the connection between the intermediate staging point of Fig. 4 and the channels incident to the switch block. Note that, if the switch block is three EPCs ' used as a memory, the intermediate staging point is connected to either the horizontal or vertical channels (depending on whether a read or write is occurring) but never both simultaneously. When not used as a memory, the intermediate switching point is isolated.
Finally, transistor M2, which directly enables the writing of the memory cell, must share its control with both the configuration circuitry and the readiwrite line. This allows the bit to be written during configuration as well as when it is written to by the user.
Results
With the assumed architectural framework, each switch block can optionally become a 124 bit wide, 8 word deep memory, for a total of 992 bits of memory per switch block. Assuming a 100 x 100 switch block FPGA, over 9.46
Megabits of memory are made available, distributed evenly throughout the chip. The area cost is 3.58 transistors per bit, significantly lower than the 7-9 required should wide shallow memories simply be embedded. This corresponds to a 29.6% increase in switch block area, which translates into an overall area increase of 26.4%
In terms of delay, there are two quantities that must be measured: the memory access time and the impact the additional circuitry has on the performance of non-memory circuits. Using HSPICE, we have measured a memory access time of 2.536 ns (not including routing into and out of the memory). To measure the impact on the performance of
5-4-3
the rest of the chip may need to be adjusted to support the Vertical Routing Tracks new architecture. In addition, the interaction between placement and routing must be examined; although the switch blocks are part of the routing, if they are to be implemented as a memory they clearly should be placed at the same time as logic blocks. Unlike existing simulated annealing based placers [ 5 ] , the location of switch blocks, as well as the location of logic blocks, must now be swappable.
Because switch blocks that are implemented as memories are not routable, the cost function for placement of these switch block memories will need to favour locations with little or no congestion. Finding solutions to these problems is part of an on-going research effort.
