A new mapping algorithm for systolic array based IP Lookup architectures by Yalinkaya, Serhat
 
 
  
 
 
 
 
A NEW MAPPING ALGORITHM  
FOR  
SYSTOLIC ARRAY BASED IP LOOKUP 
ARCHITECTURES 
 
 
 
Author: SERHAT YALINKAYA 
Advisor: PERE BARLET-ROS 
 
 
 
 
 
THIS DISSERTATION IS SUBMITTED  
FOR THE DEGREE OF  
 
MASTER IN INNOVATION AND RESEARCH IN INFORMATICS: 
HIGH PERFORMANCE COMPUTING 
IN  
INFORMATICS ENGINEERING 
 
 
 
 
 
 
 
FEBRUARY 2018 
 
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
To my family 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
 
 
  
 
iii 
 
ACKNOWLEDGMENTS 
 
 
 I wish to express my sincere gratitude to my supervisor Assoc. Prof. Dr. Pere 
Barlet-Ros for his guidance, advice and encouragements throughout the research. 
 
 I would like to thank Assoc. Prof. Dr. Cuneyt F. Bazlamacci for his 
encouragements, comments and suggestions. 
 
 I must also express my deepest gratitude to my family for their unwavering 
support, continual confidence and endless love. I could not have reached this point 
without support of my parents.  
 
 
 
 
 
 
 
 
  
 
 
  
 
iv 
 
ABSTRACT 
 
 
 
A NEW MAPPING ALGORITHM FOR SYSTOLIC ARRAY BASED  
IP LOOKUP ARCHITECTURES 
 
 
Yalinkaya, Serhat 
M.S., Department of Informatics Engineering  
Supervisor: Assoc. Prof Dr. Pere Barlet-Ros 
Department of Computer Architecture 
 
February 2019, 50 pages 
 
 
 
Increasing demand to internet requires more powerful routers to meet the 
requirements. IP lookup operation is the main time-consuming part in today’s routers. 
Routers must have higher speed with less power consumption and less physical space 
in order to perform IP lookup so that it will meet the demand to internet. Several studies 
approached in a different way to this problem. The research hypothesis of this project 
is that the memory distribution is not balanced in SRAM based systolic array 
architecture. This situation affects both memory usage and the performance of the 
system. The main objective of this project is to find appropriate methods to make the 
memory distribution balanced so that the throughput increases while memory 
requirement is reduced. The first objective is to develop a new mapping algorithm that 
makes memory distribution balanced so that general system requires less physical space 
while the throughput is either improved or not affected. We explain the idea of the new 
mapping algorithm and necessary modifications in the architecture. In the simulations, 
we created the same environment for both new mapping algorithm and previous 
mapping algorithm. The results showed significant improvement in memory 
distribution with the new mapping algorithm. In addition, the new mapping algorithm 
allows this architecture to be scaled up to 3D. The second objective of this project is to 
scale this architecture up to 3D. We explain why the previous algorithm is not 
 
 
  
 
v 
appropriate to be scaled up to 3D. We simulated 3D array architecture with the new 
mapping algorithm in the same environment as before. The results showed slight 
improvement in memory distribution and throughput. Lastly, we applied subtrie 
duplication algorithm. We duplicate and map some subtries into the system again. Some 
subtries are used significantly high as compared to the others. This unbalance traffic on 
some subtries affects the performance of our system. In real traffics, similar packets 
arrive to the system so frequently. These packets have to wait since the system has only 
one entrance point for each subtrie. We duplicate some highly demanded subtries and 
map them into the system from another entrance point. Therefore, these packets will 
wait less time before entering to the system. We applied this algorithm to 3D version 
of SAFIL (8x8x4). We implemented the simulation in the same environments as before 
to compare it with the previous simulations. 
 
 
Keywords: IP Lookup, routers, SRAM based architectures and packet forwarding 
  
 
 
  
 
vi 
 
TABLE OF CONTENTS 
 
Page 
ABSTRACT ............................................................................................................. iv 
1. INTRODUCTION............................................................................................. 1 
1.1. BACKGROUND ................................................................................................. 1 
1.1.1. ROUTER ARCHITECTURE .................................................................................... 2 
1.1.2. IP LOOKUP............................................................................................................... 3 
1.2. MOTIVATION ................................................................................................... 5 
1.3. CONTRIBUTIONS ............................................................................................ 6 
1.4. THESIS STRUCTURE ...................................................................................... 6 
2. IP LOOKUP APPROACHES ........................................................................... 7 
2.1. SOFTWARE BASED IP LOOKUP APPROACHES........................................ 7 
2.2. HARDWARE (SRAM) BASED IP LOOKUP APPROACHES ....................... 9 
3. SYSTOLIC ARRAY ARCHITECTURE FOR FAST IP LOOKUP ................. 11 
3.1. INTRODUCTION ............................................................................................ 11 
3.2. COMPONENTS ............................................................................................... 14 
3.2.1. PROCESSING ELEMENTS (PE) ........................................................................... 14 
3.2.2. SELECTOR UNITS (SU) ........................................................................................ 15 
3.2.3. CONTENTION RESOLVER (CR) ......................................................................... 16 
3.3. BINARY TRIE CONSTRUCTION ................................................................. 16 
3.4. LOOKUP PROCESS........................................................................................ 18 
4. THE NEW MAPPING ALGORITHM ........................................................... 20 
4.1. INTRODUCTION ............................................................................................ 20 
4.2. COMPONENTS ............................................................................................... 21 
4.2.1. PROCESSING ELEMENTS (PE) ........................................................................... 21 
 
 
  
 
vii 
4.3. MAPPING AND LOOKUP PROCESS ........................................................... 25 
4.4. SIMULATIONS ............................................................................................... 27 
4.4.1. SIMULATION SETUP ............................................................................................ 27 
4.4.2. THROUGHPUT AND DELAY ............................................................................... 28 
4.4.3. MEMORY REQUIREMENT .................................................................................. 29 
5. THREE-DIMENSIONAL ARRAY FOR IP LOOKUP ................................... 31 
5.1. 3D SAFIL WITH THE PREVIOUS MAPPING ALGORITHM ................... 32 
5.1.1. ARCHITECTURE OVERVIEW ............................................................................ 32 
5.1.2. DOWNSIDES........................................................................................................... 36 
5.3. 3D SAFIL WITH THE NEW MAPPING ALGORITHM .............................. 37 
5.3.1. ARCHITECTURE OVERVIEW ............................................................................ 37 
5.3.2. COMPONENTS....................................................................................................... 37 
5.4. SIMULATIONS ............................................................................................... 39 
5.4.1. SIMULATION SETUP ............................................................................................ 39 
5.4.2. THROUGHPUT AND DELAY ............................................................................... 40 
5.4.3. MEMORY REQUIREMENT .................................................................................. 41 
5.5. DUPLICATION OF THE MOST USED SUBTRIES ..................................... 42 
5.6. SIMULATION SETUP ............................................................................................ 43 
5.7. SIMULATION RESULTS ....................................................................................... 44 
6. CONCLUSION ............................................................................................... 46 
6.1. SUMMARY ...................................................................................................... 46 
6.2. FUTURE WORK ............................................................................................. 48 
BIBLIOGRAPHY ................................................................................................... 49 
 
  
 
 
  
 
viii 
 
LIST OF FIGURES 
 
Figure Page 
Figure 1.1: Router Architecture [1] ............................................................................ 2 
Figure 2.1:Binary trie based on a prefix table ............................................................ 7 
Figure 2.2: The corresponding leaf-pushed version of the binary trie in Figure 2.1 .... 8 
Figure 2.3: The corresponding binary trie after MIPS technique ................................ 9 
Figure 3.1: nxn systolic array architecture ............................................................... 12 
Figure 3.2: 4x4 SAFIL [2] ....................................................................................... 13 
Figure 3.3: The block diagram of a processing element [2] ...................................... 14 
Figure 3.4: A SAFIL frame...................................................................................... 15 
Figure 3.5: A SRAM line ........................................................................................ 15 
Figure 3.6: Initial partitioning .................................................................................. 17 
Figure 3.7: zero skip clustered subtries .................................................................... 18 
Figure 3.8: Search operation example ...................................................................... 19 
Figure 4.1: Re-designed PE ..................................................................................... 22 
Figure 4.2: re-designed SAFIL frame ...................................................................... 22 
Figure 4.3: re-designed SRAM line ......................................................................... 22 
Figure 4.4: Re-designed Combinational Logic Unit ................................................. 24 
Figure 4.5: A part of SRAM memory unit ............................................................... 25 
Figure 4.6: mapping a subtrie with the new algorithm.............................................. 26 
Figure 4.7: The node distribution of T1&T2 and T3&T4 in new mapping algorithm 30 
Figure 4.8: The node distribution of T1&T2 and T3&T4 in previous mapping 
algorithm ................................................................................................................. 30 
Figure 5.1: Neighbor connections in 2D SAFIL ....................................................... 32 
Figure 5.2: One possible way of 3D SAFIL ............................................................. 33 
Figure 5.3: Appropriate 3D version of SAFIL.......................................................... 35 
Figure 5.4: Neighbor connections in 3D SAFIL ....................................................... 36 
Figure 5.5: Re-designed combination logic unit ....................................................... 39 
 
 
  
 
ix 
 
LIST OF TABLES 
 
Table Page
Table 1.1: Forwarding table in decimal format .......................................................... 3 
Table 1.2: Forwarding table in binary format ............................................................. 3 
Table 1.3: CIDR Scheme ........................................................................................... 4 
Table 1.4: Address Aggregation ................................................................................ 5 
Table 4.1: Throughput results .................................................................................. 28 
Table 4.2: Delay results ........................................................................................... 28 
Table 4.3: Memory requirement comparison ........................................................... 29 
Table 5.1: Simulation results of 3D SAFIL .............................................................. 41 
Table 5.2: Memory requirement comparison ........................................................... 41 
Table 5.3: Throughput changes with subtrie duplication .......................................... 44 
Table 5.4: Total memory changes with subtrie duplication ...................................... 45 
Table 5.5: Memory requirement changes with subtrie duplication............................ 45 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1 
 
 
CHAPTER 1  
 
 
1. INTRODUCTION 
 
 
Internet is an interconnected computer networks that transmits data between 
devices. The data is moved from one router to another router until it reaches the 
destination address. Internet router forwards data packets between networks. A router 
is connected to multiple data lines of different networks. When a new data packet 
arrives, the router determines the destination using Internet Protocol (IP). Router 
performs IP lookup to forward the incoming data packet to the next hop. The main time-
consuming part in packet forwarding is IP lookup since it requires multiple memory 
accesses for each packet.  
 
 
1.1. BACKGROUND 
 
The main role of routers is to determine the path that data is forwarded. Data is 
routed from its source address to its final destination address through multiple routers 
by using its Internet Protocol (IP). Forwarding table stores the information of next hop 
for incoming IP packet. When a new IP packet arrives to a router, the router uses its 
forwarding table (Lookup table) and the packet is forwarded to the next hop. The 
process is called IP Lookup and it continues until IP packet reaches the destination 
address. The next hop is determined based on IP address which is located in the header 
of an IP packet.   
 
 
 
 
2 
1.1.1. ROUTER ARCHITECTURE 
 
A router consists of a control module, data plane (input/output interfaces), 
switch module and input/output ports as shown in figure 1.1. Control module, also 
known as “control plane”, works as a brain of the entire router system. It learns routing 
information from neighbor routers. Among all the information it learned, shortest paths 
to each destination are installed in the Routing Information Base (RIB). Each data plane 
has a forwarding information base (FIB) which contains optimized next hop forwarding 
information. Data plane, also known as “line card”, performs packet forwarding. Line 
card has an input packet buffer where received packets are kept temporarily and an 
output packet buffer where packets are kept before being delivered to the output port. 
Switch module, also known as “switch fabric”, functions as a bridge through the line 
cards.  
 
 
Figure 1.1: Router Architecture [1] 
 
Basically, IP router has two functions such as control functions and data path 
functions. The control functions consist of routing table construction, maintenance and 
updates. The data path functions consist of IP lookup, IP packet validation, packet 
lifetime control, checksum calculation and packet classification. IP router has two 
architectural parts such as routing engine and forwarding engine. Routing engine 
handles the control functions while forwarding engine performs data path functions. 
Although they work independently, they communicate with each other.  
 
 
 
 
3 
1.1.2. IP LOOKUP 
 
A router performs packet forwarding and determines the next hop address. In 
order to identify the output interface, It performs search operation in a routing table 
which consists of IP routes. The routing table is also known as prefix table. Only 
destination IP address of a packet is required to find the match in the prefix table.  
 
An IP address can be either 32 bits long or 64 bits long. 32 bits long is 
considered in this thesis. IP addresses are represented as four decimal numbers 
separated by dots such as 216.3.128.12. Each prefix in a prefix table is presented in 
decimal by “prefix/prefix length” such as 170.170.0.0/16 where the number after the 
slash indicates the length of the prefix. Prefixes are actually bit strings which are as 
long as prefix length and each one has a “*” at the end. For example, 170.170.0.0/16 is 
represented as “1010101010101010*” and it covers all addresses that start with bit 
string “1010101010101010” which means it covers 216-2 different addresses. If all the 
bits of a prefix match the most significant bits of an IP address, the prefix is considered 
to be matched with that IP address in IP lookup. Table 1.1 and table 1.2 show examples 
to prefixes in both decimal format and binary format. 
 
Table 1.1: Forwarding table in decimal format 
 
 
Table 1.2: Forwarding table in binary format 
 
  
 
 
 
 
4 
Previously, the simple class-based addressing scheme was used. Basically, IP 
address space is divided into three classes such as class A, B and C. Each class has 
different size of networks. Class A, B and C has 8, 16- and 24-bits network part and 24, 
16- and 8-bits host part, respectively, as shown in table 1.3. This scheme is not efficient 
in terms of using the IP address space. Class C networks has maximum 254 host 
addresses while class B has 65534 host addresses. Class C networks has very few host 
addresses while class B has too much host addresses. There is no middle-sized class 
between class B and C. It causes inefficiency if a network address is given to a middle-
sized organization. Class C does not meet the demand while choosing class B is 
wasteful. To allow for a more efficient use of the IP address, Classless Inter Domain 
Routing (CIDR) addressing scheme was introduced. 
 
Table 1.3: CIDR Scheme 
 
 
 
CIDR has two main benefits. Firstly, prefixes may have arbitrary length which 
lets the address space be used more efficiently. Secondly, CIDR allows arbitrary 
aggregation of networks. For example, table 1.4 shows a network addresses represented 
by numbers from 124.12.16/24 to 124.12.31/24. The most significant 20 bits of all the 
addresses in this range are the same. Therefore, those 16 networks can be aggregated 
into one network represented by 20 bits prefix which is 124.12.16/20 if all these 
addresses are only reachable addresses through the same port. 
 
 
 
  
 
 
 
 
5 
Table 1.4: Address Aggregation 
 
 
In CIDR, addresses may match more than one entry in a lookup table because 
of prefix overlap. In this scenario, the router must find the most specific match for the 
correct decision which is known as the longest prefix matching (LPM). LPM is more 
difficult than finding the exact match since the destination address does not have any 
information about the length.  
 
 
1.2. MOTIVATION 
 
In SRAM based systolic array IP lookup [2], prefix table is mapped into a binary 
trie. Afterwards, the trie is divided into subtries and each node of the subtries is mapped 
onto the engine for pipeline. The trie traversal is performed on the engine. The memory 
size of the engine is decided by the size of the processing element that requires the 
maximum memory. I was interested in hardware of the engine. When I studied on the 
architecture, I realized that the trie is not distributed in balance. A processing element 
might have almost zero node inside while another one has thousands. Also, previous 
algorithm requires two pointers in SRAM while our approach requires only one. These 
problem causes unnecessary memory requirement since total memory requirement is 
decided by the processing element that requires the maximum memory. Our new 
mapping algorithm provides a balanced node distribution like a flat and reduces 
memory requirement while performance is not affected bad. In this project, we will 
show that the same performance can be achieved while memory is reduced with the 
new map.  
 
 
 
 
6 
1.3. CONTRIBUTIONS 
 
• We implemented simulations for SRAM based systolic array architecture [2] in 
C++. Then, we re-designed processing elements and applied the new mapping 
algorithm. 
• We implemented the 3D version of SAFIL in C++. Then, we applied subtrie 
duplication algorithm. 
 
 
1.4. THESIS STRUCTURE 
 
The rest of the thesis is organized as follows. Chapter 2 presents the background 
and related works for IP Lookup approaches. Chapter 3 explains the architecture of 
Systolic Array Fast IP Lookup. Chapter 4 discusses the new mapping algorithm and the 
corresponding simulation results. Chapter 5 presents 3D version of SAFIL 
implementation and the subtrie duplication algorithm. Finally, chapter 6 summarizes 
our work. 
 
 
 
 
 
 
7 
 
 
CHAPTER 2  
 
 
2. IP LOOKUP APPROACHES 
 
 
2.1. SOFTWARE BASED IP LOOKUP APPROACHES 
 
The basic data structure is binary trie in software-based IP Lookup. Prefixes are 
presented as a path from the root to a node. Since the structure of the binary trie 
represents the bit strings of the prefixes, there is no need to store bit strings in each 
binary trie nodes. Each node has two pointers (the left child pointer and the right child 
pointer). If a binary trie node has a prefix, the corresponding next hop information (port 
number) is also stored in the node. Figure 2.1 shows a prefix table and the 
corresponding binary trie. 
 
 
 
Figure 2.1:Binary trie based on a prefix table 
 
  
 
 
 
 
8 
Search operation starts from the root node and continues through other nodes 
by using IP address bits. The matching result is updated at each node. The search 
operation finishes when it reaches to a leaf node or requested node does not exist. A 
node is called leaf node if it does not have any child node. The last matching result is 
selected as the longest matched prefix. For example, a search operation for an address 
starting with “0000” matches two prefixes (P1 and P3). When the search operation 
arrives the node with port number P1, the matching result is updated to P1. On the other 
hand, when the search operation arrives the node with port number P3, the matching 
result is updated to P3. Therefore, the last matching result is the longest matched prefix 
in binary trie. In this structure, it is easy to implement prefix insertion, deletion and 
route changes. 
 
Leaf pushing algorithm is an algorithm that pushes all the prefix nodes to leaves 
nodes. Then, the binary trie is called leaf-pushed binary trie [3]. In a leaf-pushed trie, 
all non-leaf nodes have at least a child node and contain only pointers to its children 
while the leaf nodes contain only a next hop information. Figure 2.2 shows the leaf-
pushed version of the binary trie in figure 2.1. 
 
  
Figure 2.2: The corresponding leaf-pushed version of the binary trie in Figure 2.1 
 
 
 
 
  
 
 
 
 
9 
 
 
Figure 2.3: The corresponding binary trie after MIPS technique 
 
 
The binary trie can be compressed by using the MIPS technique [4]. If left child 
node and right child node have the same next hop value, they can be replaced by their 
parent node with the same next hop value. Figure 2.3 shows the compressed version of 
the binary trie in figure 2.2. 
 
 
2.2. HARDWARE (SRAM) BASED IP LOOKUP APPROACHES 
 
In the literature [2, 5, 6, 7], different type of hardware-based solutions was proposed 
using SRAM. Multiple memory accesses are required during IP lookup. Single SRAM 
based IP lookup approaches is not suitable for multiple memory accesses. In order to 
provide more throughput, SRAM based pipelined solutions have been proposed. These 
architectures consist of multiple memory elements. A binary trie can be mapped onto a 
SRAM based pipelined architectures. Each part of binary trie is mapped on different 
stages. The trie is traversed on these separated and multiple memory elements. In one 
dimensional pipeline architecture, enough memory stages exist and accessed at most 
once during a search. An ordinary mapping of the binary trie onto the stages results in 
unbalanced memory distribution even though pipelined architectures improved the 
throughput. 
 
  
 
 
 
 
10 
Different solutions have been proposed for memory balancing problem. In [5], 
a ring pipeline architecture is proposed in order to solve memory balance problem. This 
approach divides the binary trie into subtries and chooses different pipeline stage as a 
starting point in order to make a balanced pipeline. The starting stage is determined by 
a hash function. The starting stage is selected so that the depth of each subtrie is less 
than or equal to the number of stages. This architecture has two different data paths. 
One is to find the starting pipeline stage while the other is to lookup operation. The 
throughput of this architecture is 0.5 lookups per cycle.  
 
In [6], the ring architecture is improved with a new method called Circular, 
Adaptive and Monotonic Pipeline (CAMP). This architecture performs a direct lookup 
on the first r-bits of address in order to find the starting stage. It has multiple entry and 
exits points which improves the throughput. Each pipeline stage has two inputs and one 
output. By using FIFO queue, access conflicts are solved. The throughput of this 
architecture is 0.8 lookups per cycle. 
 
In [7], multiple pipeline architecture is proposed to improve the performance. It 
proposed Parallel Optimized Linear Pipeline (POLP) in which each pipeline is able to 
operate concurrently. Trie partitioning and subtrie-to-pipeline mapping provide a 
balanced mapping. Furthermore, a stage might have nops (no operations) for balanced 
node distribution. The throughput of this architecture is 8 lookups per cycle. 
 
In [2], systolic array architecture is proposed for pipelined IP lookup. Initial 
partitioning and node to stage mapping are used in order for balanced parallel pipelines. 
In this architecture, each stage has multiple processing elements and each processing 
elements is connected to its own neighbor elements. The processing elements located 
in the border can be chosen as starting point. Search operation starts from one of these 
processing elements and can continue through different pipelines. Although the node 
distribution is accepted as naturally balanced since two-dimensional architecture is 
used, the node distribution can be improved. This architecture is the one that is 
improved with our algorithm. In the next chapter, this architecture is explained in detail. 
 
  
 
 
 
 
11 
 
 
CHAPTER 3  
 
 
3. SYSTOLIC ARRAY ARCHITECTURE FOR FAST IP 
LOOKUP 
 
 
3.1. INTRODUCTION 
 
SRAM based array architecture (SAFIL) has the structure of 2D torus topology and 
is operated like a systolic array in order to benefit from multi-pipeline parallelism. In a 
systolic array, processing elements are arranged in an array structure where data flows 
synchronously between neighbors in different directions. Each processing element has 
two input and two output in the directions from west and north to east and south. The 
output is given to the neighbors in the opposite direction of input. Elements can be 
called cell or node, as well. Each node is responsible of receive, compute and transmit 
tasks. The communication with outside is made by the boundary cells only. The cells 
give the information to their neighbors after performing needed operations on the data. 
Figure 3.1 shows an example of systolic array architecture with nxn processing 
elements. The systolic arrays have attractive properties such as synchronization, 
modularity, regularity, locality, finite connection, parallel pipelining and modular 
extendibility. 
 
 
  
 
 
 
 
12 
 
 
Figure 3.1: nxn systolic array architecture 
 
 
The topology in SAFIL is a 2D torus topology. In a torus topology, there are 
wrap-around connections between processing elements. Unlikely a regular torus 
topology, in SAFIL, processing elements are not wrapped around. The wrap-around 
connections are between contention resolvers (CR) and processing elements (PE). 
SAFIL can be considered as an array of PEs in a 2D topology and is operated like a 
systolic array to benefit from multi pipeline parallelism. Processing element is similar 
to a central processing unit. The operations are synchronous and transport triggered. 
Only border processing elements can make communication with outside. 
 
A binary trie becomes disjoint subtries after several processes. Then, these 
binary subtries are mapped on SAFIL. The mapping starts from input stage PE and 
continues through the array. The input stage PEs are the ones that CRs are connected. 
Selector units (SU) are responsible of finding the input stage PEs. The root of a subtrie 
is stored in corresponding input stage PE. SAFIL frame is a frame that traverses 
between PEs for search operation. The SAFIL frame includes the destination IP address 
of a packet. Each frame is constructed before search operation starts. The contention 
resolver is used in case that contention occurs since multiple search requests may arrive 
to the same input stage PE. CR is responsible of choosing which frame to be forwarded 
to the corresponding PE. In case of contention, only one frame is selected and the others 
are hold. Each SU is connected to every other CR while each CR is connected to only 
 
 
 
 
13 
one PE. The endpoints of each row and column are connected to the corresponding CRs 
(torus topology). If a search needs to circulate, the other search requests from SUs are 
not accepted by CR. The priority belongs to the frame that is already inside the search 
operation. Figure 3.2 shows 4x4 SAFIL architecture and its components. 
 
 
 
Figure 3.2: 4x4 SAFIL [2] 
 
 
  
 
 
 
 
14 
3.2. COMPONENTS 
 
 
3.2.1. PROCESSING ELEMENTS (PE) 
 
The block diagram of a processing element is given in Figure 3.3. A PE is 
composed of three parts such as FIFO queue block unit, SRAM memory unit and 
combinational logic unit. Each PE has two input lines that come from north and west 
neighbors and are connected to the FIFO queue. Each PE has also two output lines that 
are connected to the east and south neighbor PEs. The trie nodes are stored in SRAM 
memory unit. Incoming frames are first stored in queue. Whenever their turns come, 
frames are issued by PE. By checking the corresponding bit of the IP address, the 
combinational logic unit modifies the frame and routes it to the next neighbors which 
might be in the direction of either east or south. In each cycle, functions of a PE as 
follows. Firstly, incoming SAFIL frames are inserted into FIFO. Then, the 
combinational logic unit modifies the frame which is retrieved from FIFO, by 
information that is read from SRAM. The modified frame is forwarded to the one of 
east or south outputs.  
 
 
Figure 3.3: The block diagram of a processing element [2]  
 
 
 
 
15 
 
 
Figure 3.4: A SAFIL frame 
 
 
Figure 3.5: A SRAM line 
 
A SAFIL frame has three fields such as address bits (A), SRAM index (I) and 
port number (P) as shown in figure 3.4. The field A has the least significant t-bits of the 
IP address which is under search operation. The most significant 32-t bits are used for 
initial partitioning. The field I is a pointer index to the SRAM unit. Lastly, the field P 
holds the search results. A single bit data available signal (DAV) is used between two 
neighbor PE as well as (t+p+q)-bits wide data bus connection. 
 
Figure 3.5 shows a line of SRAM. Each SRAM units has the width of (2p+q+1) 
bits. South index (SI) and east index (EI) has the size of p-bits. Port number (PN) has 
the size of q-bits. The valid bit indicates whether the current trie node is a prefix or not. 
The combinational unit modifies the P-field of SAFIL frame if the current node is a 
valid node. Each frame carries the latest matched port number.  
 
 
3.2.2. SELECTOR UNITS (SU) 
 
Selector unit is a combinational logic that has destination IP address as an input 
and processes the initial r-bits of the address in order to find the input stage PE. SU also 
finds the memory address of the root node of corresponding subtrie so that a frame can 
start to search operation. 
 
 
 
 
 
 
 
16 
3.2.3. CONTENTION RESOLVER (CR) 
 
There is a case that multiple search requests arrive to the same input stage PE. CR 
accepts an IP address into the system to be searched. In case of contention, only one of 
the frames is selected to be searched while others are hold.  
 
 
3.3. BINARY TRIE CONSTRUCTION 
 
In this chapter, we will explain how binary trie is constructed so that it can be used 
in IP Lookup. We have the prefix tables in binary format as explained in the first 
chapter. Based on the bits in the addresses, the trie is constructed.  
 
• If the corresponding bit is “0”,  
o if there is no left child,  
§ the program creates a left child node. 
o if there is an already created left child node,  
§ the program uses that node. 
• If the corresponding bit is “1”,  
o if there is no right child,  
§ the program creates a right child node. 
o if there is an already created right child node,  
§ the program uses that node. 
 
Basically, the program traverses the trie to find the correct location for a prefix. 
When it finds the correct node, the corresponding next hop information is stored in the 
node. Each prefix node has the next hop information. After all prefixes are processed 
to the trie, the construction is finished. But, trie modifications continue so that the trie 
becomes more memory efficient to be mapped into the system. First of all, leaf pushing 
algorithm is applied. Leaf pushing algorithm causes the binary trie to expand 1.6 times 
[8]. In leaf pushing algorithm, if a prefix node is not a leaf node, it is forwarded to the 
empty leaf nodes.  
 
 
 
 
17 
 
After the trie is expanded by leaf pushing algorithm, the trie is compressed by 
eliminating the redundant prefixes. If both left and right child nodes have the same next 
hop information, they are combined and moved to their parent node which reduces the 
number of nodes in the trie. This process continues upwards recursively until a trie node 
with a different next hop information is seen. As a result of this algorithm, there is no 
prefix node that has any left or right child nodes. 
 
After the trie is expanded and compressed, the trie is divided into subtries by 
using several initial bits in order to be mapped into the system more memory efficiently. 
This process is called initial partitioning. The number of initial bits is called initial stride 
(r). The number of subtries is equal to 2r. Figure 3.6 shows an example to initial 
partitioning with initial stride 4. 
 
After the trie is divided into separate subtries, zero/one skip clustering algorithm 
[9] is applied to these subtries. By skipping leading zeros or ones, a subtrie can be 
divided into smaller subtries as shown in figure 3.7. This algorithm provides more 
memory efficiency for mapping. In case of search operations, after initial stride (r-bits) 
bits, consecutive zero/one bits of IP address are skipped. These bits are used to find the 
starting stage PE. Then, the search operation starts from the most significant bits.  
 
 
Figure 3.6: Initial partitioning  
 
 
 
 
18 
 
 
Figure 3.7: zero skip clustered subtries 
 
 
3.4. LOOKUP PROCESS 
 
When a new search operation request comes, it firstly waits for an available Selector 
Unit (SU). SU issues the first r-bits and consecutive zero/one bits of the IP address. 
Depending on these bits, starting Processing Element (PE) is decided by SU. Moreover, 
SU finds the memory address of corresponding subtrie root in the starting PE. If more 
than one SU requires the same starting PE, corresponding Contention Resolver (CR) 
selects one of these and holds the other SUs. The priority always belongs to the frame 
that comes from wrap-around connections. When a SU is in wait state, it keeps the 
request and does not accept new request. When a frame enters to a PE, it is issued by 
using other bits than the most significant r-bits and forwarded to other stages. At a PE, 
the direction that the frame is forwarded is decided based on the corresponding bit of 
the IP address. The search operation continues in the direction of east if the bit is “1” 
and the direction of south if the bit is “0” as shown in figure 3.8. 
 
  
 
 
 
 
19 
 
 
Figure 3.8: Search operation example 
 
 
Update is not a problem in this architecture. Updates included prefix insertion, 
prefix deletion and route change which are easy to be handled in SAFIL. Prefix 
insertion starts with a search operation. If the destination node does not exist or the path 
is not completed, necessary nodes are inserted to the trie. Prefix deletion also starts with 
a search operation. When the destination node is searched, it is either deleted or 
unmarked depending on whether it has a child node or not. The parents of the 
destination node might be unnecessary. In this case, they are also deleted. Route change 
also starts with a search operation. When the destination node is reached, only the next 
hop information is updated. 
 
 
  
 
 
 
 
20 
 
 
CHAPTER 4  
 
 
4. THE NEW MAPPING ALGORITHM 
 
 
4.1. INTRODUCTION 
 
SAFIL is designed so that left and right child nodes are mapped into different 
Processing Elements (PE). Each PE stores the two pointers information for the next 
PE’s SRAM unit since left and right child nodes are mapped separately. Sending the 
child nodes to the different PEs causes unnecessary memory usage. Moreover, it is 
assumed that the system is naturally memory balanced [2] in SAFIL since 2D 
architecture is used. The node distribution can be improved with a new algorithm which 
is aware of memory balance so that the memory distribution becomes really balanced 
like a flat. This new algorithm also eliminates one of the two pointers in the PE’s SRAM 
unit by sending the children to the same neighbor. Moreover, we used the advantages 
of having prefix nodes with no child nodes (leaf pushing algorithm). Since a node is 
either prefix or has at least one child node, there is no need to spare a place in SRAM 
unit for Port Number and Pointer. We eliminate this unnecessary memory usage with 
the help of new mapping algorithm.  After new mapping algorithm is applied, the 
memory requirement of the system is reduced significantly while the throughput and 
delay are not affected.  
 
  
 
 
 
 
21 
The binary trie is implemented in the same way as we implemented before. The 
difference here is that we do not apply zeros/ones skip clustering so that we can benefit 
the advantage of leaf pushing algorithm. The binary trie is implemented based on the 
prefix table and leaf pushing algorithm is applied. Then, binary trie is partitioned to 
subtries using initial partitioning with stride of 8 bits (the same as before). Finally, each 
subtrie is mapped onto the IP lookup engine with the new mapping algorithm. In this 
chapter, we explain the necessary modifications in architecture and the new mapping 
idea. Then, we compare the new mapping algorithm and previous mapping algorithm 
based on the simulation results. 
 
 
4.2. COMPONENTS 
 
Since we apply only new mapping algorithm to the IP lookup engine, the only part 
that is affected is Processing Elements (PE). Other parts work exactly the same as 
before. Therefore, there is no need to modify Selector Units (SU) and Contention 
Resolvers (CR). In this part, we only explain Processing Elements.  
 
 
4.2.1. PROCESSING ELEMENTS (PE) 
 
The block diagram of the new processing element is given in Figure 4.1. The 
new PE is composed of three parts such as FIFO queue block unit, SRAM memory unit 
and combinational logic unit. Each PE has two input lines that come from north and 
west neighbors and are connected to the FIFO queue. Each PE has also two output lines 
that are connected to the east and south neighbor PEs. The information of trie nodes is 
stored in SRAM memory unit.  
 
 
 
  
 
 
 
 
22 
 
Figure 4.1: Re-designed PE 
 
 
Figure 4.2: re-designed SAFIL frame 
 
 
Figure 4.3: re-designed SRAM line 
 
 
A new SAFIL frame has two fields such as address bits (A) and SRAM index 
(I) as shown in figure 4.2. We used the same naming for easy comparison. The field A 
has the least significant t-bits of the IP address which is under search operation. The 
most significant 32-t bits are used for initial partitioning which is explained in the 
previous chapter. The field I is a pointer index to the SRAM unit. This field does not 
store the least significant bit of the pointer which is added by combinational logic unit 
using address bit.   
 
 
 
 
23 
Figure 4.3 shows a re-designed line of SRAM. Each SRAM units has the width 
of ((p-1) +1+1+2) bits. Next pointer has the size of (p-1)-bits while port number has the 
size of q-bits. The size of Port Number & Next Pointer field is the maximum of p-1 and 
q bits. In this case, p-1 is mostly greater than q bits. Therefore, this field is considered 
as p-1 bits. The valid bit and direction fields are only one bit while the finish bits field 
is 2-bits. 
 
Pointer & Port Number: This place in SRAM memory unit is used to store either a 
pointer to the next neighbor or a port number which belongs to a prefix node. Since we 
apply leaf pushing algorithm, all leaf nodes are prefix nodes which means there is no 
need to keep a place for next pointers for those nodes. Also, if a node is not a leaf node, 
it is not a prefix node which means there is no need to keep a place for port number for 
those nodes. Since a node can be either leaf or not, we merged pointer and port number 
bits in SRAM unit in order to avoid unnecessary memory usage. Combinational logic 
unit decides if this are should be used as pointer or port number. 
 
Valid Bit: This bit is set when a node is a prefix node during mapping. We need this 
bit in order to distinguish pointer and port number bits. If the valid bit is set, which 
means the node is prefix (leaf node), those bits are used as port number and search 
operation is finished. Otherwise, the search operation continues and those bits are used 
as pointer to the next neighbor. 
 
Direction Bit: This bit is keeping the information of the direction that a frame is 
supposed to continue. This bit is set during mapping operation. When the DIR bit is set, 
the frame is forwarded to the east neighbor’s queue. When the DIR bit is reset, the 
frame is forwarded to the south neighbor’s queue.  
 
Finish Bits: A node might have only one child. While mapping operation, since we 
anyway keep a place for two children nodes no matter parent node has two children or 
not, a frame might access that empty place during search operation. It is an unnecessary 
resource usage if we let it happen. In order to avoid this problem, we have Finish Bits 
that indicates if there is no need to continue to search operation. If the parent node has 
no child, which means it is a leaf node (prefix node), the search operation is finished 
 
 
 
 
24 
anyway without considering finish bits. If the parent node has both left and right 
children nodes, the search operation still continues without considering finish bits. 
However, if the parent node has only one child either left or right child node, these 
finish bits are considered to decide whether search operation continues or not. 
 
Incoming frames are first stored in the queue as before. Due to the fact that the 
search operation is finished when a search operation accesses to a prefix node, there is 
no need to keep a place for port number in queue. Frames do not carry that information 
until the end.  
 
When a frame is issued by PE, firstly the SRAM data is accessed. Then, based 
on SRAM data, the combinational logic decides whether or not search operation 
continues and which way it is supposed to continue. If the valid bit is set, the search 
operation terminates with the port number information from SRAM memory unit. If 
the valid bit is not set, then the logic unit checks if the search operation is trying to 
access a null node. Logic unit decides it by using finish bits and the corresponding bit 
in IP address. If this is the case, the search operation terminates with an empty port 
number which means prefix is not found. If there is not such a case, the search operation 
continues with a direction based on direction bit. The modified frame is forwarded to 
that direction. The combinational logic unit is shown in figure 4.4 in detail. 
 
 
Figure 4.4: Re-designed Combinational Logic Unit 
 
 
 
 
25 
4.3. MAPPING AND LOOKUP PROCESS 
 
After trie modifications, we have subtries to be mapped. Each subtrie has different 
starting PE based on their subtrie id. After deciding the starting PE, the root of a subtrie 
is mapped firstly. Then, each child is mapped recursively. There are two key points of 
mapping.  
 
The first one is to locate two children at the same neighbor. We allocate two 
lines for the children of each parent node if the parent node has at least one child node. 
The right child is located in the next line where the left child is located. We distinguish 
these children by inserting the corresponding IP address bit as a least significant bit to 
the pointer during search operation. For example, if the children are located in a 
memory line starting from 0x0000 (0b0000000000000000). We know that left child is 
located at 0x0000 and right child is located at 0x0001. The SRAM unit stores the 
information covering both lines i.e. instead of storing 0b0000000000000000, it stores 
0b000000000000000. The figure 4.5 shows an example of 4 memory lines. First two 
lines are holding left and right child of a parent node while third and fourth lines are 
holding children of another parent node. The first line and the second one has all the 
bits the same except the least significant one. The third line and the fourth one has the 
same situation. We only store the same bits in SRAM memory units. During search 
operation, the least significant bit comes from IP address. 
 
 
 
Figure 4.5: A part of SRAM memory unit 
  
 
 
 
 
26 
This idea causes gaps in SRAM unit. Although we can get rid of those gaps by 
changing mapping algorithm a bit, it is not a good idea to change it. Firstly, we should 
store starting line completely Instead of storing only the same bits in SRAM unit. Then, 
if we add the corresponding IP address bit to the starting line, we can have the correct 
pointer to right or left child. Therefore, we will not need to allocate two lines while a 
parent node does not need. This idea solves gap problem easily. However, it will make 
update operations impossible for the new mapping algorithm. For example, if insertion 
operation is supposed to be applied after all mapping operation is finished, the new 
node will be located very far away from its sibling. There will be no common bits 
anymore to be stored in parent node. There will be requirement for one more pointer in 
SRAM unit. Therefore, we didn’t touch these gaps although they increase the memory 
requirement of the system. 
 
The second key point of the new mapping is to locate the children to the 
neighbor with less nodes. During mapping operation, every time when children nodes 
are located, a neighbor with less nodes is chosen so that unbalance growth will be 
avoided. The algorithm always chooses to fill the processing elements with less nodes 
instead of filling random one. Figure 4.6 shows a small part of mapping a subtrie to IP 
lookup engine.  
 
 
Figure 4.6: mapping a subtrie with the new algorithm 
 
  
 
 
 
 
27 
4.4. SIMULATIONS 
 
We implemented the same environment for new mapping algorithm and previous 
mapping algorithm. The same prefix tables and traffics are used for the simulations. 
Also, the same number of processing elements, selector units and contention resolvers 
are used in the simulations.  
 
 
4.4.1. SIMULATION SETUP 
 
We simulated SAFIL architecture with both previous algorithm and new 
algorithm so that we can compare the results. We used 16x16 SAFIL having 32 Selector 
Units (SU). The FIFO size is chosen as 30. Both simulations have cache option with a 
size of 50 lines. Each line of cache holds the information for only one IP address.  
 
We have two real routing tables and one real traffic trace. The routing tables 
have the size of 200k prefixes. The real traffic trace has the size of 500k packets. We 
followed two methods for generating simulation data. Firstly, we generated synthetic 
traffic traces by using an available routing table from [11] These traces have uniformly 
distributed burst lengths which is the number of consecutive packets having the same 
destination IP address. Synthetic traffics are generated based on real routing tables with 
burst lengths of 2 and 10. We have 4 different experiments to be used in our simulations 
after first method. Secondly, we generated routing tables with different gaussian prefix 
length distribution by using a real traffic trace [10]. The length distributions are selected 
as 18, 20 and 24. After the second method, we have 3 more experiments to be used in 
our simulations. In total, we have 7 different experimental setups. T1 and T2 have the 
same routing table but different traffic traces (burst lengths of 2 and 10, respectively). 
T3 and T4 have the same routing table but different traffic traces (burst lengths of 2 and 
10, respectively). T5, T6 and T7 have the same traffic traces (real traffic) but different 
routing tables (length distributions of 18, 20 and 24). 
 
 
 
 
 
28 
Leaf pushing algorithm is applied for all cases. Initial partitioning is performed 
to each binary trie using the most significant 8 bits of IP addresses. We performed both 
zero and one skip clustering. 
 
 
4.4.2. THROUGHPUT AND DELAY 
 
Throughput is considered as the number of lookups performed in one cycle. 
Delay is considered as the time lap between when a packet arrives and when it leaves 
the system. Table 4.1 and table 4.2 present the throughput and the delay for simulated 
packet traces with caches. The throughput results are almost the same for two mapping 
algorithms in simulations with synthetic traffics. The new mapping algorithm has 
advance in real traffic traces which is more important. The delay results are always 
lower with the new mapping algorithm. As a result, we can say that the new mapping 
algorithm provides better throughput and delay results besides to better memory 
consumption  
 
 
Table 4.1: Throughput results 
 
 
Table 4.2: Delay results 
 
 
 
 
 
 
 
29 
4.4.3. MEMORY REQUIREMENT 
 
Table 7 represents simulation results for both algorithms. Since T1 & T2 have 
the same prefix table and, T3 & T4 have the same routing table, their mapping results 
are the same. Due to the gap that new mapping algorithm has, the nodes that are mapped 
into the system is more than the nodes that are available. The number of available nodes 
is the number of nodes in binary trie after all operations are completed such as leaf 
pushing, initial partitioning and zero and one skip clustering. The gap is around %25 of 
the nodes. While the number of nodes is 1.6M nodes, the new mapping algorithm maps 
2M nodes. New mapping algorithm has significant improvement against previous 
mapping algorithm even though it maps %25 more nodes. Total memory requirement 
is reduced by 45% with the new mapping algorithm. In table 4.3, max indicates the 
node number of a PE that has the greatest number of nodes while min indicates the 
opposite. Figure 4.7 and figure 4.8 shows the node distribution of mapping algorithms. 
It is obvious that memory distribution is significantly balanced with the new mapping 
algorithm as compared to the previous mapping algorithm. 
 
 
Table 4.3: Memory requirement comparison 
 
 
 
  
 
 
 
 
30 
 
 
 
Figure 4.7: The node distribution of T1&T2 and T3&T4 in new mapping algorithm 
 
 
 
Figure 4.8: The node distribution of T1&T2 and T3&T4 in previous mapping 
algorithm 
 
 
 
  
 
 
 
 
31 
 
 
CHAPTER 5  
 
 
5. THREE-DIMENSIONAL ARRAY FOR IP LOOKUP 
 
 
SRAM based IP lookup solutions require multiple memory access. Using pipeline 
in IP lookup gives better performance than one stage architectures. Moreover, multi-
pipeline architectures give better performance than single pipeline architectures. All 
studies show that increasing pipeline results in better performance. Therefore, we 
propose a new approach to SRAM based IP lookup solutions by increasing the 
dimension of array architecture. SAFIL architecture has 2D torus topology. In this 
chapter, we firstly present suitable 3D version of SAFIL architecture for the previous 
mapping algorithm. Then, we present 3D version of SAFIL architecture with the new 
mapping algorithm.  
 
 
 
 
 
 
 
 
 
 
  
 
 
 
 
32 
5.1. 3D SAFIL WITH THE PREVIOUS MAPPING ALGORITHM 
 
 
5.1.1. ARCHITECTURE OVERVIEW 
 
In 2D SAFIL architecture, each processing element receives inputs from two 
neighbors (west and north) and gives output to two neighbors (east and south). In total, 
each PE has four neighbors. Figure 5.1 shows the neighbors of the black node and the 
data flow in a 4x4 array. Since array architecture provides only two directions for a 
packet to follow, only processing only one bit is sufficient to determine the direction. 
If the bit is "0", direction is to south. On the other hand, it is to east if the bit is "1". 
When 3D systolic array is mentioned, the first architecture that mostly comes to mind 
is shown in figure 5.2. 
 
 
 
Figure 5.1: Neighbor connections in 2D SAFIL 
 
 
  
 
 
 
 
33 
 
  
Figure 5.2: One possible way of 3D SAFIL 
 
 
This architecture cannot work with the previous mapping algorithm. The idea 
of pipeline parallelism is to forwards the data in a direction. In 2D SAFIL, the direction 
is generally from west and north to east to south. Therefore, all packets flow in these 
directions. It prevents from deadlock. If a frame remains in a loop between two PEs, 
the frames that have to wait for the same PEs are going to suffer from that situation. 
Therefore, the frames need to be forwarded in a direction. In this architecture, each PE 
has six neighbors. Four of them are located in the same floor while one is located in the 
upper floor and one is located in the lower floor.  Three of them provides input to 
current PE while PE gives output to other three neighbors. Processing one bit is not 
sufficient to distinguish three directions. On the other hand, processing two bits 
provides four different directions while this architecture has only three output direction 
for a PE. One might offer to use up and down directions for both input and output. For 
example, “00” means south, “01” means east, “10” means down and “11” means up.  
  
 
 
 
 
34 
This scenario is not appropriate for pipeline idea as explained above. Many 
deadlocks might appear. A packet can be forwarded to upper PE and lower PE all the 
time which means the data is stuck in a couple of processing elements. Therefore, a 
neighbor can be used either to receive an input data or to send the output data. Since 
this architecture does not have enough number of neighbors for a processing element, 
it cannot be used with an algorithm that requires to process IP address bits to decide the 
direction. Since previous mapping algorithm processes bits of IP addresses while map 
and search operations, this architecture is not suitable for 3D SAFIL with the previous 
mapping algorithm. 
 
The idea of having 3D SAFIL array is to increase the possible pipeline 
directions to follow. We need to process more than one bit of IP address in a PE to be 
able to distinguish more than two directions. Processing two bits provides four different 
directions. Therefore, each PE needs to receive inputs from four neighbors and give 
output to four neighbors. In total, each processing element needs to have eight 
neighbors instead of having only four. Also, the data needs to flow through the pipeline.  
 
Figure 5.3 shows the architecture of 3D array that is suitable for the previous 
mapping algorithm. Each floor is shown in different color to make it more obvious. PE 
receives inputs from four neighbors (North-Up, North-Down, West-Up and West-
Down) and gives output to four neighbors (East-Up, East-Down, South-Up, South-
Down). In total, each processing element has eight neighbors. We do not call them 
neighbor PEs if they are just located next to each other, but we call them neighbor PEs 
if they are connected to each other. In figure 5.3, although some PEs might look like 
neighbor since they are located next to each other, they are not called neighbor PEs, 
because there is no connection between them. In this architecture, each PEs has 
connections with only the ones located in the upper and lower floors. By doing so, each 
PE can have four output directions and process two bits of IP address. 
 
 
  
 
 
 
 
35 
 
 
Figure 5.3: Appropriate 3D version of SAFIL 
 
 
Figure-5.4 Shows the connections that a PE has in the 3D architecture. The 
possibly direction is the combination of one of the two directions (east and south) and 
one of the other two directions (up and down). This combination is made by 
combinational circuit by processing 2 bits of IP addresses. 1 bit is responsible to decide 
east or south and 1 bit is responsible to decide up and down.  In this mapping algorithm, 
the directions represent the children. Therefore, each node of the tries should have 4 
children instead of having only 2. The prefix table needs to be mapped into a multi-bit 
tries. 
 
 
  
 
 
 
 
36 
 
 
Figure 5.4: Neighbor connections in 3D SAFIL 
 
 
5.1.2. DOWNSIDES 
 
Having four directions means four different pointers in SRAM unit. Also, PEs are 
not processing one bit in each cycle anymore. Therefore, the combinational logic unit 
of PE should be changed and also the index of SRAM unit should be changed. 
Previously, each SRAM units has the width of (2p+q+1) bits. Having two extra pointers 
increases the width to (4p+q+1) which will almost double the memory requirement of 
the system. The huge increase in memory makes this idea illogical. To sum up, previous 
mapping algorithm is not suitable to be scaled up to 3D array. 
 
  
 
 
 
 
37 
5.2. 3D SAFIL WITH THE NEW MAPPING ALGORITHM 
 
 
5.2.1. ARCHITECTURE OVERVIEW 
 
There are several problems that previous mapping algorithm has for 3D array 
such as the directions & bit processing and huge increase in memory. These two 
problems are not valid for the new mapping algorithm. First of all, there is no memory 
increase due to scaling up to 3D. Increasing the dimension means increasing the 
possible pipeline directions in other words. The new mapping algorithm does not 
process IP address bits in order to decide the direction of a frame. The decision is done 
based on node loads of neighbors. The new mapping algorithm locates each child to the 
same target neighbor PE. Only one pointer is sufficient no matter how many children a 
parent node has. Therefore, the new mapping algorithm does not suffer from increasing 
the possible directions of the pipeline. Secondly, the new mapping algorithm decides 
the direction based on neighbors’ node loads. There is no relation between the pipeline 
directions and IP address bits. The two 3D architectures that are explained above can 
be used without having any connection problems. In addition, there is no requirement 
for big changes in components with the new algorithm. 
 
 
5.2.2. COMPONENTS 
 
There is no need to change Selector Units (SU) and Contention Resolvers (CR) 
since they are all identical with the previous version. The only part that needs to be 
changed is Processing Elements (PE). Other parts work exactly the same as before. In 
this part, we only explain Processing Elements.  
  
 
 
 
 
38 
5.2.2.1. PROCESSING ELEMENTS 
 
The idea of processing element remains the same, but it requires small changes 
to adapt to the 3D array. There is no change in FIFO queue. Pointer & Port Number 
field and valid bit field in SRAM unit remains the same. However, Direction field and 
Finish Bits field are expanded if necessary. In addition, the combinational logic unit 
needs to be re-designed. 
 
The bits in direction field is keeping the information that shows the direction 
that a frame needs to be forwarded. Previously, the number of possible directions was 
2. 1 bit is sufficient to distinguish 2 different directions. The two 3D array models have 
3 and 4 possible directions which means having 2 bits in direction bits field is sufficient 
for both. While the number of directions is increased, the number of direction bits also 
need to be increased. If a binary trie is used in IP lookup, there is no need to change 
Finish Bits field. Finish Bits field needs to be expanded only if the structure of trie is 
changed. 
 
Figure 5.5 represents the newly designed combinational logic unit. There is 
not much modification in the unit. Multiplexer has been changed from 1x4 to 1x8 
since the number possible directions are increased to 4.  
 
 
 
  
 
 
 
 
39 
 
 
Figure 5.5: Re-designed combination logic unit 
 
 
5.3. SIMULATIONS 
 
 
5.3.1. SIMULATION SETUP 
 
We simulated 3D SAFIL architecture with the new algorithm. We used 8x8x4 
SAFIL (with connections through intermediate directions). We used the same number 
of PEs so that we can compare it with the previous simulation results. The number of 
Selector Units (SU) is selected as 32. The FIFO size is chosen as 30. Both simulations 
have cache option with a size of 50 lines. Each line of cache holds the information for 
only one IP address.  
 
 
 
 
 
 
40 
We created the environment in the same way we did for the previous 
simulations. We have two real routing tables (200k prefixes) and one real traffic trace 
(500k packets). We followed two methods for generating simulation data. Firstly, we 
generated synthetic traffic traces by using an available routing table. These traffic traces 
have uniformly distributed burst lengths which is the number of consecutive packets 
having the same destination IP address. Synthetic traffics are generated based on real 
routing tables with burst lengths of 2 and 10. We have 4 different experiments to be 
used in our simulations after first method. T1 and T2 have the same routing table but 
different traffic traces (burst lengths of 2 and 10, respectively). T3 and T4 have the 
same routing table but different traffic traces (burst lengths of 2 and 10, respectively). 
 
Secondly, we generated routing tables with different gaussian prefix length 
distribution by using a real traffic trace. The length distributions are selected as 18, 20 
and 24. After the second method, we have 3 more experiments to be used in our 
simulations. T5, T6 and T7 have the same traffic traces (real traffic) but different 
routing tables (length distributions of 18, 20 and 24). In total, we have 7 different 
experimental setups.  
 
Leaf pushing algorithm is applied for all cases. Initial partitioning is performed 
to each binary trie using the most significant 8 bits of IP addresses. We performed both 
zero and one skip clustering. 
 
 
5.3.2. THROUGHPUT AND DELAY 
 
The number of completed lookups in one cycle gives the throughput. Delay is 
the time lap between when a packet arrives and when it leaves the system. Table 5.1 
presents the throughput and the delay for simulated packet traces with caches. The 2D 
and 3D versions of SAFIL do not have significant difference in terms of throughput 
and delay.  
 
 
 
 
 
 
 
41 
 
Table 5.1: Simulation results of 3D SAFIL 
 
 
 
5.3.3. MEMORY REQUIREMENT 
 
Table 5.2 represents simulation results. Since T1 & T2 have the same prefix 
table and, T3 & T4 have the same routing table, their mapping results are the same. Due 
to the gap that new mapping algorithm has, the nodes that are mapped into the system 
is more than the nodes that are available. The number of available nodes is the number 
of nodes in binary trie after all operations are completed such as leaf pushing, initial 
partitioning and zero and one skip clustering. The gap is around %25 of the nodes. 
While the number of nodes is 1.6K nodes, the new mapping algorithm maps 2K nodes. 
Like throughput, there is no significant difference between two models. The reduction 
in memory can be neglected since it is very small amount.  
 
Table 5.2: Memory requirement comparison 
 
 
  
 
 
 
 
42 
5.4. DUPLICATION OF THE MOST USED SUBTRIES 
 
We use the initial bits of IPs in order to partition the main prefix trie. Firstly, we 
use initial partitioning, then zeros/ones skip clustering. If two arriving packet arrives 
with the destination IPs that have the same initial bits, they need to start from the same 
clustered subtrie for search operation. Therefore, one of them should wait while the 
other enters the system. If more than two arrives with the same condition, all packets 
have to wait except one of them. We had two different test methods for simulations. 
One of the methods was to generate a synthetic prefix table based on real traffics. When 
we simulate this method, we realized that real traffic packets are arriving very 
unbalanced. We considered the ratio of IPs with the same initial bits. This ratio was 
very different for varied initial bits. Some of them arrive very frequently while some of 
them arrive very rare. Therefore, there exists a huge density on some subtries while 
others stay almost idle. 
 
This unbalanced situation affects the performance of our system. Therefore, we 
come up with a solution to this problem. If we detect the most used clustered subtries 
and map them into the system from different entrance points as an alternative to the 
original ones, there will be more available starting point options for busy clustered 
subtries. With this method, we can manage to distribute the density of the traffic. 
However, mapping some subtries again will increase the memory usage of our system. 
Normally, we expect to see at most doubled throughput if we double the resource of 
the system. All resource is memory in our case. If we double the memory of the system, 
we can expect to see at most doubled throughput. In our case, we increase memory in 
a purpose with specially chosen nodes. Therefore, we expect to see more than doubled 
throughput even if we double the memory by increasing it with the chosen subtries. In 
general, we expect to see higher increase in throughput as compared to the increase in 
memory.  
  
 
 
 
 
43 
5.5. SIMULATION SETUP 
 
We used the same environments again. We consider the samples with the real 
traffics for throughput increase. We consider the samples with the real prefix tables for 
memory increase. The duplication of some subtrie has no limitation unless we define 
it. We can duplicate subtries as much as we want. However, we cannot duplicate infinite 
times. After some duplications, the throughput will not increase with the same ratio as 
memory increase. Therefore, we should define a limit to duplication number. 
 
We defined some parameters in order to find the best number of duplications. The 
parameters are the total number of duplications, the duplication number of each 
clustered subtrie, the usage ratio for a clustered subtrie and the checking period. Usage 
ratio is the usage number of a subtrie in one checking period. If a subtrie is used in a 
checking period more than the usage ratio that we defined, it will be duplicated. This 
process continue until a subtrie reaches its local limit (the duplication number of each 
clustered subtrie) or total limit is reached. We found the best results for our test samples 
with these parameter values: 
 
● Total number of duplications: 20  
● Duplication number of each clustered subtrie: 2  
● Usage ratio for a clustered subtrie: 0.015  
● Checking period: 1000 cycles 
 
 
 
 
 
 
 
  
 
 
 
 
44 
5.6. SIMULATION RESULTS 
 
In the simulations, we observed that the throughput is significantly increased with 
the new method. There is no change in throughput for some tests (T1, T2, T3, T4) 
because the traffic that we used is synthetic. Table 5.3 shows that tests with real traffic 
traces (T5, T6, T7) have throughput increase from 136% to 220%. Having this amount 
of improvement in tests with real traffic traces makes the results more reliable. These 
numbers can be increased even more by changing the parameters that we defined. 
However, it increases memory so much that the increase in throughput becomes 
senseless.  
 
 
Table 5.3: Throughput changes with subtrie duplication 
 
 
 
 Total memory is increased at most 23% considering first four tests as shown in  
table 5.4. Since tests (T5, T6, T7) have synthetic prefix table, there is no sense to 
consider the memory increase in these tests. Table 5.5 shows how much memory 
requirement of the system is increased. Memory requirement is increased at most 24%. 
Increasing throughput at least 136% by increasing resources at most 24% is the most 
effective usage of SAFIL. 
 
 
 
 
 
 
 
45 
 
 
Table 5.4: Total memory changes with subtrie duplication 
 
 
 
Table 5.5: Memory requirement changes with subtrie duplication 
 
 
 
 
 
 
 
 
 
 
 
  
 
 
 
 
46 
 
 
CHAPTER 6  
 
 
6. CONCLUSION 
 
 
6.1. SUMMARY 
 
The research hypothesis of this project is that SAFIL architecture uses unnecessary 
memory since memory distribution is not balanced which also affects the throughput 
of the system. In this work, we aim to reduce the memory requirement of the system 
and to increase the throughput of the system.  
 
The first objective that we wanted to achieve is to develop a new mapping algorithm 
that makes the memory distribution balanced so that general system requires less 
memory. We focused on node load of the system while mapping instead of bit string of 
the corresponding nodes. The subtries followed the PEs with less node loads during 
mapping operations. The simulations proved that the new mapping algorithm has 
advantage over the previous algorithm in memory usage. Table 4.3 in chapter 4 shows 
that the memory usage is reduced by at least 45%.  
 
The second objective that we wanted to achieve is to scale this architecture up to 
3D. We explained why it is not possible to scale the system with previous mapping 
algorithm up to 3D. We implemented the 3D version of the SAFIL architecture with 
the new mapping algorithm. Although 3D version of SAFIL has advantage over the 
initial SAFIL in throughput and memory usage, it has no advantage over 2D version of 
SAFIL that has the new mapping algorithm. This is because the new mapping algorithm 
 
 
 
 
47 
makes the memory distribution highly balanced. Since the error rate is so low even in 
2 dimensions, scaling up to 3 dimensions cannot have significantly improve on the 
performance of the system.  
 
The last objective of this project is to duplicate the popular subtries. In real traffic 
traces, some subtries are used very frequently as compared to the others. This situation 
affects the throughput of the system because the packets have to wait before entering 
the system since most of them wants to enter from the same entrance point. Then, we 
developed a method that chooses the most used subtries and maps them into the system 
again from the different entrance points. This method of course increases the memory 
usage since some subtries are duplicated. However, this increase in memory results in 
much higher throughput increase since we choose these subtries with an algorithm. 
Table 5.3 in chapter 5 shows that this method improved the throughput at least 136% 
for the tests with real traffic traces. Table 5.5 in chapter 5 shows the memory increase 
due to the new method is at most 24%. We can conclude that the performance can be 
increased if the workload is distributed. Not only memory should be balanced, but also 
the workload should be balanced. 
 
To sum up, we have proved our hypothesis by achieving our objects. The inefficient 
memory usage problem of SAFIL has been solved. The throughput has been increased 
by solving the memory problem. Further increase in throughput has been achieved with 
the last method.  
  
 
 
 
 
48 
6.2. FUTURE WORK 
 
In the future developing of IP lookup, we can consider some technical 
improvements. First of all, the shape of the SAFIL structure can be re-designed in order 
to have more homogeneous system. Having 3D version is one way to do that. More 
homogeneous system will distribute the workload more balanced. Secondly, the 
duplication method can be improved. The parameter decision and remapping can be 
studied so that better results can be obtained. Thirdly, a method can be developed so 
that the gap in the new mapping algorithm can be avoided. Lastly, this new mapping 
algorithm can be tried with different trie construction methods as well as initial 
partitioning and zeros/ones skip clustering.  
 
  
 
 
 
 
49 
 
 
BIBLIOGRAPHY 
 
 
[1] Aweya, J. (2001). IP router architectures: An overview. International Journal 
of Communication Systems,14(5), 447-475. doi:10.1002/dac.505 
[2] Erdem, O., & Bazlamacci, C. F. (2010). Array Design for Trie-based IP Lookup. 
IEEE Communications Letters,14(8), 773-775. 
doi:10.1109/lcomm.2010.08.100398 
[3] Srinivasan, V., & Varghese, G. (1999). Fast address lookups using controlled 
prefix expansion. ACM Transactions on Computer Systems,17(1), 1-40. 
doi:10.1145/296502.296503 
[4] Wang, G., & Tzeng, N. (2006). TCAM-Based Forwarding Engine with 
Minimum Independent Prefix Set (MIPS) for Fast Updating. 2006 IEEE 
International Conference on Communications. doi:10.1109/icc.2006.254712 
[5] Baboescu, F., Tullsen, D., Rosu, G., & Singh, S. (n.d.). A Tree Based Router 
Search Engine Architecture with Single Port Memories. 32nd International 
Symposium on Computer Architecture (ISCA05). doi:10.1109/isca.2005.7 
[6] Kumar, S., Becchi, M., Crowley, P., & Turner, J. (2006). Camp. Proceedings 
of the 2006 ACM/IEEE Symposium on Architecture for Networking and 
Communications Systems - ANCS 06. doi:10.1145/1185347.1185355 
[7] Jiang, W., Wang, Q., & Prasanna, V. K. (2008). Beyond TCAMs: An SRAM-
Based Parallel Multi-Pipeline Architecture for Terabit IP Lookup. 2008 
Proceedings IEEE INFOCOM - The 27th Conference on Computer 
Communications. doi:10.1109/infocom.2007.241 
[8] Song, H., Kodialam, M., Hao, F., & Lakshman, T. (2009). Scalable IP lookups 
using shape graphs. 2009 17th IEEE International Conference on Network 
Protocols. doi:10.1109/icnp.2009.5339697 
 
 
 
 
50 
[9] Erdem, O., & Bazlamacci, C. F. (2012). High-performance IP Lookup Engine 
with Compact Clustered Trie Search. The Computer Journal,55(12), 1447-
1466. doi:10.1093/comjnl/bxs008 
[10] http://www.caida.org/data/passive/passive_2015_dataset.xml 
[11] http://www.routeviews.org/routeviews/ 
 
 
