Double Data Rate (DDR) SDRAMs have been prevalent in the PC memory market in recent years and are widely used for networking systems. These memory devices are rapidly developing, with high density, high memory bandwidth and low device cost. However, because of the high-speed interface technology and complex instruction-based memory access control, a specific purpose memory controller is necessary for optimizing the memory access trade off. In this paper, a specific purpose DDR3 controller for high-performance table lookup is proposed and a corresponding lookup circuit based on the Hash-CAM approach is presented.
I. INTRODUCTION
With the development of network systems, packet processing techniques are becoming more important to deal with the massive high-throughput packets of the internet. Accordingly, advances in memory architectures are required to meet the emerging bandwidth demands. Content Addressable Memory (CAM) based techniques are widely used in network equipment for fast table look up [1] . However, in comparison to Random Access Memory (RAM) technology, CAM technology is restricted in terms of memory density, hardware cost and power dissipation. Recently, a Hash-CAM circuit [2] , which combines the merits of the hash algorithm and the CAM function, was proposed to replace pure CAM based lookup circuits with comparable performance, higher memory density and lower cost. Most importantly, off-chip highdensity low-cost DDR memory technology has now become an attractive alternative for the proposed Hash-CAM based lookup circuit. However, DDR technology is optimised for burst access for cached processor platforms. As such, efficient DDR bandwidth utilization is a major challenge for lookup functions that exhibit short and random memory access patterns. The extreme low-cost and high memory density features of the DDR technology allow a trade-off between memory utilisation and memory-bandwidth utilisation by customising the memory access. This, however, requires a custompurpose DDR memory controller that is optimised to achieve the best read efficiency and highest memory bandwidth [3] . The objective of this work was to investigate advanced DDR3 SDRAM controller architectures and derive a customised architecture for the abovementioned problem.
This paper proposes an advanced DDR3 memory controller architecture optimised for a DDR3 based high-performance table lookup circuit and presents its implementation using Altera Stratix III FPGA technology.
II. RELATED WORK
Owing to the multi-bank architecture and burst write/read mode, concurrent operations on different banks are allowed in the SDRAMs. High memory bandwidth can be achieved by scheduling the memory access to each bank. Based upon this idea, many SDRAM users focus on bank control and data access sequences to achieve a better system performance.
In the applications of multimedia processing systems, Kim [4] presented an address-translation technique, which increases the memory bandwidth by 50%; Jaspers [5] mapped video data units into the memory in accordance with the statistics of actual data access results and interleaved access to the memory banks; Zhang [6] and Zhu [7] provided memory management solutions for H.264 applications, which increases bus efficiency by approximately one third. For general-purpose DDR memory controller implementations on FPGAs, the Xilinx DDR3 controller [8] keeps four banks open at a time and the least recently used bank will be closed for access to the unopened bank.
All these efforts abovementioned are under the assumption that successive data access patterns are predictable and thus data could be stored and retrieved at the previously known address location of the memory. However, this method is not applicable to the random data access required in networking systems. By regrouping the addresses Mladenov [9] is able to avoid excessive switching between rows, thus increasing the memory efficiency for random access. However, this method is highly dependent on the access pattern and the performance improvement is not guaranteed.
The presented research targets the above shortcomings by exploring specific purpose memory controller architectures that permit utilisation of the DDR3 memory bandw
III. DESIGN METHODOLOGY
DDR3 SDRAM is the 3 rd gener memories, featuring higher performa power consumption [10] . In comparis generations, DDR1/2 SDRAM, DDR3 higher density device and ach bandwidth due to the further increas rate and reduction in power consump from 1.5V power supply at 90 n technology. With 8 individual banks, is more flexible to be accessed wi conflicts.
The proposed Hash-CAM based l shown in Figure 1 .
The original data and refere information are stored in the DDR lookup request (data input) for a gi pipelined and processed by the Hash generate an address. This addr forwarded to DDR3 SDRAM Interfa translated into instructions and addr recognized by the DDR3 memory The stored data & addresses in th read back to the Hash-CAM circu validate the match. In the case of corresponding reference address is re As depicted in Figure 2 , a co SDRAM interface consists of an initia refresh control, command control, an control circuit. The proposed DDR responsible for generating D commands, addresses, data and specific control and timing order signa 
A. DDR3 SDRAM timing & ban
The most important param technology are column addre (tCAS), RAS to CAS laten precharge time (tRP) and row [10] .
The worst case row cycle tim successive random accesses to bank, where tRC = tRAS + tR value is usually higher than 2 order to overcome the random set by tRC, two successive ran in a bank must therefore be a this, data contents in the DDR duplicated into each bank and generated by the controller its each active or read command. that any two continuous read different banks of the memory. possible data allocation schem Figure 3 , where n DDR3 SD bundled together sharing the sa A number of burst data SDRAM device is used to stor data as well as the referenc address of the first burst data in is calculated from the hash valu The group of the input data (Da same hash value rest at the sam (Addr(k)) of different memory table entrants, the equivalent blocks (k) in the Hash-CAM i configuration of the DDR3 SDRA and the number (n) of memory be expressed as
where W DQ is the data width of W DATA is the input data width an of the reference address of the selected, the CAM size is th calculating the percentage data Hash-CAM, as 
B. WR/RD cycle
The command state diagram for the proposed controller is shown in Figure 4 . The definitions of all timing parameters can be found in [10] . As half clock rate is applied to the controller circuit, all timing parameters of the DDR3 memory are divided by two when calculating the equivalent number of clock cycles in the state machine. Because this controller circuit is designed for table look up, write operation is not taken into the account for the lookup performance as it is only working on the table update stage. Figure 4(a) gives the normal write/read operations with the auto precharge option. The write/read cycles must satisfy the delay parameters, such as tRCD, tWL. With this approach, the next access must wait a certain period of time, known as tRC, until the currently opened row is closed. This results in an unnecessary delay. As discussed in Section III.A, fast read can be achieved by switching banks. Bank control logic is used to issue desired bank addresses at each cycle when a bank active command or read command is issued. The state machine for this method is given in Figure 4(b) . The proposed controller provides the control interface for switching between normal write/read mode and fast read mode.
Unlike other data processing techniques, the distinct characteristic of the random data lookup is the uncertainty of the incoming data. In this work, address FIFOs are applied to buffer the row/column addresses separately for each read request. The "empty" flag of the row address FIFO (addr_fifo_empty) is checked in order to evaluate whether the next command is active (ACT) or read (RDA).
C. Fast read timing
When operating in fast read mode, the input data rate is the dominant parameter that determines the command sequences. Figure 5 shows an example timing diagram of the controller commands output, with a selected DDR3 memory model.
Under such conditions, this half-rate controller is ready to accept the read request at every other clock cycle. Continuous read requests apply when the input data frequency is faster than half of the controller clock frequency, as given in Figure 5(a) . In this case, the controller reaches its best performance during read operations and the DDR3 data bus is fully utilised. For discontinuous read request, the memory bus efficiency is dependent on the incoming data frequency. An example is given in Figure 5(b) .
Note that in this test circuit all of 8 banks from the DDR3 memory are involved in this "ACT-RDA" cycle. In other cases, fewer banks could be selected to allow more memory spaces for the data storage, as long as the time interval between two ACT commands on the same bank is less than tRC.
IV. EXPERIMENTAL RESULTS AND ANALYSIS
In order to validate the proposed controller architecture within a data lookup application, a complete test circuit comprised of Hash-CAM block, Altera's PHY megacore and Micron's DDR3 memory model (MT41J128M8) are used. The prototype is designed to work at half-rate frequency of 200MHz and the Hash-CAM sub-block is designed to work at quad-rate, at 100MHz. The functional simulation results are shown in Figure 6 . The input data (S_data_in) is clocked at 100MHz. R0' ) is comprised of an address and a data field, stored in the DDR3. The R0 and R0' data fields are compared in parallel with the input data (S_data_in). The corresponding address field of the matching data set is asserted as the final address value (Addr_out). The total lookup latency for each request is 15 clock cycles at 100MHz. Because the system is fully pipelined, successive address outputs are expected after 15 clock cycles at every clock cycle.
The complete lookup circuit was synthesised using Altera's Stratix III technology and tested with a 64-bit wide DDR3 module. Post-layout synthesis results are shown in Table I . The estimated on-chip power dissipation in this work is 4513.79mW. The above experimental results clearly validate the expected performance of the proposed custompurpose DDR3 controller architecture. For a given random lookup the DDR3 peak memory bandwidth can be achieved. Although the available memory space is one-eighth of the entire memory storage capacity, the look up performance was improved by at least 10 times, in comparison to the Hash-CAM circuit presented in [2] .
V. CONCLUSIONS
In this paper, an advanced DDR3 memory controller architecture for high-performance table lookup is proposed and its deployment within a high-performance Hash-CAM based lockup circuit is presented. The design study has shown that high-performance and large lookup table circuits can be implemented using low-cost state-of-the-art FPGA and DDR3 technology. The proposed DDR3 Hash-CAM circuit is prototyped for a 128K table entry and verified for a 2Gbyte DDR3 address space. Synthesis results presented in Table I show that a CAM circuit with 104bit wide and 512-entry can be built on standard FPGA devices at 100MHz, operating frequency. Considering such an embedded CAM sub-circuit and a 2Gbyte DDR3 module, the proposed DDR3 Hash-CAM lookup architecture is capable of supporting 104bit wide and 1M-entrant TCP/IP header lookup table with a sustainable lookup performance of up to 100 million packets per second. Assuming a worst case smallest packet size of 64bytes, the proposed lookup circuit is suitable for router/switch ports for address lookup or packet classification at sustainable line-rates above 50Gbit/s.
