Global memory mapping for FPGA-based reconfigurable systems by Ouaiss, Iyad & Vemiri, Ranga
Global Memory Mapping for FPGA-Based Reconfigurable Systems
Iyad Ouaiss and Ranga Vemuri
Digital Design Environments Lab, University of Cincinnati
Cincinnati, OH 45221-0030, USA
fiouaiss, rangag@ececs.uc.edu
Abstract
Synthesizing designs for FPGA-based reconfigurable
systems involves the task of mapping variables and data
structures of the application onto RAMs of the reconfig-
urable board. The variety in types and performance of on-
board and on-chip RAMs, their proximity to the processing
units, and the interconnection scheme of the reconfigurable
system, all contribute to an intricate memory mapping prob-
lem. An intelligent memory assignment minimizes the to-
tal latency of the design and the interconnection require-
ments due to memory accesses. A complete Integer Linear
Programming (ILP) formulation of the problem results in
an optimized memory mapping; however, the formulation
is complex and takes a very long time to produce a solu-
tion. In order to efficiently solve the problem, the concept of
global/detailed memory mapping is introduced in this pa-
per. An ILP formulation of the global mapping process
is described. This formulation is simpler and faster than
the complete formulation, and it leaves the task of detailed
mapping to a post-ILP tool that does not affect the optimal-
ity of the memory assignment. As a result, larger designs
can be handled at a faster rate and more constraints can be
introduced to the formulation.
1 Introduction
Whether an application is to be implemented as an ASIC
or on an FPGA, variables in the design have to be assigned
onto physical memory banks and operations onto hardware
logic. In both technologies, memory mapping is the task of
assigning each data structure in the design to one or more
physical RAM. Mapping schemes vary widely in complex-
ity: A simple scheme might not consider splitting a data
structure across two physical banks or might cater solely
to single-port memories. A more complex scheme might
This work is supported in part by the US Air Force, Wright Labora-
tory, WPAFB, under contract number F33615-97-C-1043.
dynamically partition a data structure onto several possi-
bly different memory banks, might support overlapping of
data structures onto the same physical memory space, or
might take into account the application’s access patterns to
the data structures, while performing the assignment.
With signal and image processing applications, mem-
ory mapping becomes a crucial step in the synthesis pro-
cess: The performance of these data-intensive applications
is heavily affected by the quality of the memory assignment.
In image and speech processing aplications, physical RAMs
can easily occupy more than half the ASIC implementation.
Thus, if data structures are not cleverly mapped, congestion
in routing and memory access degradation can occur.
For ASICs, memory mapping consists of selecting mem-
ory components from a library, selecting where the compo-
nents are placed, and selecting the way in which they are
connected to the hardware logic. Whereas in the case of
FPGAs, and of reconfigurable computing (RC) systems in
general, the mapping consists of assigning the data struc-
tures to a fixed hardware platform. The ASIC scenario as-
sumes that the memory banks are picked to match the data
structures of the application; whereas in the RC scenario,
the data structures are manipulated to fit on the pre-existing
physical banks.
RC boards with modern FPGAs not only have off-chip
physical banks but also offer a large number of on-chip
memories. Hence, due to the abundance of physical mem-
ory banks as well as the amount and importance of data
structures in signal processing algorithms, it becomes dif-
ficult to manually perform memory mapping. The on-chip
memory banks of Xilinx Virtex devices [18], called Block-
RAMs, vary from 8 BlockRAMs for the XCV-50 device
up to 208 BlockRAMs for the XCV-3200E device. On-
chip memory banks of Altera FLEX 10K devices [2], called
Embedded Array Blocks (EABs), vary from 9 EABs for the
EPF10K70 device up to 20 EABs for the EPF10K250A de-
vice. On-chip memory banks of Altera APEX E devices
[1], called Embedded System Blocks (ESBs), vary from 12
ESBs for the EP20K30E device up to 216 ESBs for the
EP20K1500E device. Table 1 summarizes the onchip RAMs
0-7695-0990-8/01/$10.00 (C) 2001 IEEE
Device RAM RAMs Size Confi-
Name (# banks) (# bits) gurations
4096x1
2048x2
Xilinx BlockRAM 8! 208 4096 1024x4
Virtex 512x8
256x16
2048x1
Embedded 1024x2
Altera Array 9! 20 2048 512x4
Flex 10K Block 256x8
128x16
2048x1
Embedded 1024x2
Altera System 12! 216 2048 512x4
Apex E Block 256x8
128x16
Table 1. FPGA On-chip RAMs
available in today’s FPGAs.
In addition to the number of available RAMs on an FPGA,
the number of memory ports of each on-chip memory bank
could be greater than one. Both Xilinx and Altera devices
offer dual-ported memories, each port accessing the same
physical space. Finally, the depth/width ratio of each mem-
ory bank could be variable. Both Xilinx and Altera devices
have five configurations depicted in Table 1.
Little work has been done to automatically assign data
structures to complex RC systems; furthermore, features
such as multiple configurations per memory bank, have not
yet been incorporated in the synthesis process. This paper
proposes a modification to memory mapping: the process is
divided into global mapping followed by detailed mapping.
An ILP formulation is used to optimize the solution by per-
forming global mapping. Since detailed mapping does not
affect the quality of the assignment, it can be performed af-
ter global mapping hence reducing the complexity of the
ILP formulation.
The rest of this paper is organized as follows: Section 2
discusses previous work performed in memory mapping.
Section 3 describes the inputs to the environment in which
memory mapping is performed. Section 4 depicts the Inte-
ger Linear Programming approach for memory mapping; it
first briefly states the complete ILP formulation and then in-
troduce the global v/s detailed memory mapping paradigm.
Section 5 shows some results obtained. Finally, Section 6
concludes the paper and presents future work.
2 Previous Work
The majority of the memory mapping studies focus on
the ASIC implementation. The tools pick a set of physical
memory modules from a library of available banks and se-
lect the interconnection structure to connect the processing
units to the memory banks. Very few studies target recon-
figurable boards where memory banks and interconnection
structure are fixed before synthesizing the application.
In ASIC implementations, since the hardware is custom
built, the aim is to minimize the interconnection cost and the
number of required physical memory banks. However, with
on-chip memory and fixed external memory banks and in-
terconnection structures in RC systems, the problem is dif-
ferent. Off-chip interconnections are reduced when using
on-chip memory. Furthermore, given a fixed memory struc-
ture on an RC board, minimizing the number of required
memory banks might not produce the optimal solution; As
long as the mapper does not exceed the physical storage
area, it should be allowed to use as many banks as it sees fit.
Integer linear programming models were used in [8, 11]
to group registers and form multi-port memory modules in
ASIC implementations. Several studies were conducted in
the realm of high-level synthesis where cliques of the design
variables are partitioned to form data segments. Some re-
searchers performed this task without taking into considera-
tion the interconnection structure of the hardware [3], while
others consider the cost of interconnection during variable
grouping for multi-port memory banks [16].
In [15], memory mapping for FPGAs with on-chip mem-
ories is addressed; however, only single-ported memory
banks are assumed. The same technique was improved in
[17] so that the mapping caters to recent FPGAs containing
dual-ported on-chip banks. In both works, the focus is on
hardware containing a single type of memory banks (either
single or dual ported) and does not simultaneously consider
off-chip memories that exhibit different performance num-
bers.
Memory access optimizations are targeted in [12], where
the goal is to optimize memory accesses in pipelined de-
signs. Synthesis transformations take advantage of on-chip
RAMs and intelligent schedules are produced to maximize
parallel accesses. Therefore, memory mapping takes place
during synthesis and does not address hierarchical memory
banks.
Given data structures and access constraints to these
structures, [5] finds a legal packing of the logical seg-
ments into the physical segments while minimizing the area.
Again, since the storage area in the RC framework is fixed,
it might not be beneficial to minimize the occupied area.
An analysis of several memory mapping studies is pre-
sented in [13]. The authors compare the techniques based
on the number and the type of logical memories and physi-
cal banks considered simultaneously. In addition, the book
by Catthoor et al. [6] and the book by Panda, Dutt, and
Nicolau [14] provide an excellent source for topics in mem-
ory system optimization, exploration, and management.
In a previous report [9], we implemented a memory map-
ping technique that performs a complete memory assign-
ment in a single step. However, the formulation becomes
0-7695-0990-8/01/$10.00 (C) 2001 IEEE
quite lengthy and the solution time explodes for large prob-
lems. Instead of this “flat view” solution, this paper di-
vides the logical-to-physical memory mapping process into
two steps: first, global memory mapping assigns each data
structure in the application to one type of physical mem-
ory banks. A bank type refers to a collection of physical
memories that share the same architectural properties and
the same access performances. Second, detailed memory
mapping performs the lower-level assignment; it restruc-
tures the data segments and assigns them to specific banks
of the type that was dictated by global mapping.
Since all banks of the same type share the same per-
formance and architecture specifications, detailed memory
mapping does not affect the overall memory mapping cost.
The optimization goal is thus sought after during global
memory mapping, and the complexity of the global map-
ping is reduced by avoiding detailed mapping. By reducing
the size of the problem, an ILP formulation becomes sim-
pler and the solution is obtained faster. As a result, designs
with several hundred data structures and memory banks are
efficiently handled.
3 Problem Formulation
Several features exist for different types of RC boards.
This section generalizes the approach of memory mapping
by targeting a flexible hierarchical memory structure.
A memory mapper must take as an input both the archi-
tecture of the target RC board as well as a description of
the design to be synthesized and mapped onto the board.
For this paper, it is assumed that the RC board contains
only one processing unit. As part of future enhancement,
this work will be extended to multi-processing units where
logic placement and pin constraints during routing will be
addressed.
3.1 Architecture Description
As introduced in Section 2, The RC board architecture
is described by a collection of memory types. There could
be several instances of each memory type, but all instances
share the same storage and access speed specifications, and
share the same proximity and ease of access from the pro-
cessing unit.
For each memory type, a number of instances tells the
mapper how many instances of each type exist on the board.
The number of ports of a type is one if the memory is a
single-ported memory, two if dual-ported, etc. As shown
in Figure 1, the depth/width ratio of a memory could be
variable; The number of configurations for each type is the
number of possible settings of each port of that type.
The number of words (depth) and the number of bits per
word (width) of a type are unique numbers if only one con-
Width1
WidthM
Depth1
DepthM
Latency
Model
Pin Traversal
Model
PortN
Port1
TypeJ
Design
Logic
Physical
Memory
Bank
Figure 1. Generic Memory Bank
figuration exists. Otherwise, these are equal-length lists
of numbers describing the possible configurations of the
depth/width ratio. Entry i of the depth list together with
entry i of the width list correspond to configuration i. It is
assumed that the capacity of each configuration is a con-
stant.
The access latency for each type of memory is variable;
The read latency is the number of clock cycles required af-
ter performing a read and before getting valid data out of the
memory bank. Similarly, the write latency is the number of
clock cycles required after performing a write operation and
before the data is correctly stored in the memory bank.
Finally, with respect to the physical location of the mem-
ory bank, the number of pins traversed depicts the proxim-
ity of the physical memory bank to the processing unit. If
a bank is on-chip, zero pins are traversed. If an off-chip
bank is directly connected to the memory, two pins are tra-
versed. If an indirect connection exist between the process-
ing unit and the external memory bank, then additional pins
are traversed. In general, the aim is to map data structures to
physical banks that are as close as possible to the processing
unit; The further away they are, the larger the impact on the
overall memory access performance.
A generic physical bank is shown in Figure 1. The la-
tency model captures the number of read and write clock
delays and the pin traversal model captures the number of
pins traversed between the processing unit and the memory
bank. In the Figure, bank type J is an N-ported memory, has
M different configurations: depth1/width1, depth2/width2,
..., depthM/widthM.
3.2 Task graph Description
On the design side, a description of the data structures is
required. Since this work focuses on placing the data struc-
tures on the physical banks, it is assumed that the structures
0-7695-0990-8/01/$10.00 (C) 2001 IEEE
are already formed.
For each data segment in the design, the number of words
(depth) in the segment and the number of bits per word
(width) are required. A footprint analysis of the memory
accesses could tremendously help in guiding the mapping
process: e.g. data segments that are extensively accessed
should be assigned to faster and closer physical banks.
3.3 Conflict Description
During synthesis of a design, scheduling determines the
life times [7, 4] of the variables and data structures. This life
cycle analysis could further improve the memory mapping
since segments that can overlap could be placed in the same
storage area, thus decreasing the total storage requirement.
For this purpose, the mapper needs to know which data seg-
ments life cycles overlap. A set of conflicting pairs captures
this requirement; Pair (L1, L2) means that data segment L1
cannot share storage space with segment L2.
4 ILP Formulation
For the formulation presented in this section, we assume
the following notation. There are M data structures:
DS= fDS1;DS2; :::;DSMg to be mapped onto N different
types of physical memory banks:
PB = fPB1;PB2; :::;PBNg.
There could be multiple instances of each type of mem-
ory bank.
For each logical data structure d, we have:

Dd Number of words in segment d.
Wd Number of bits per word in segment d.
For each type of physical memory bank t, we have:
8
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
:
It Number of banks of type t.
Pt Number of ports in a bank of type t.
Ct Number of depth/width configurations in a
bank of type t.
Dt Array of number of words in a bank of type t.
Wt Array of number of bits per word in a bank
of type t.
RLt Read latency in number of clock cycles.
WLt Write latency in number of clock cycles.
Tt Number of pins traversed from the processing
unit to a bank of type t.
where the depth/width ratio variables are:
Dt = fd1;d2; :::;dCtg and Wt = fw1;w2; :::;wCtg.
Finally, there are Q conflict pairs in the design where
each associates two logical structures:
(DSx;DSy), where x 6= y.
Finally, the remaining notations pertain to the 0-1 vari-
ables used in the model. Zdt associates a data structure to a
memory type:
Zdt =
8
<
:
1 if data structure d is assigned to some
instance of bank type t.
0 otherwise.
Zdt is used to force an oversized data structure to be split
across banks of the same type. Similarly, Xdtip associates a
data structure to a specific physical bank:
Xdtip =
8
<
:
1 if data structure d is assigned to port p
of instance i of bank type t.
0 otherwise.
And, only for multi-configuration physical banks (i.e.
Ct > 1), Ytipc sets a specific configuration to a port of a mem-
ory bank:
Ytipc =
8
<
:
1 if configuration c is selected for port p
of instance i of bank type t.
0 otherwise.
4.1 Global Memory Mapping
Global mapping only considers the task of assigning a
data structure to exactly one type of memory bank. It does
not deal with the assignment of the data structure to spe-
cific instances and ports of the type. However, global map-
ping ensures a successful detailed mapping by taking into
account the architecture specification while avoiding non-
optimizing factors in the formulation.
While a complete memory mapper makes use of all three
Xdtip, Zdt , and Ytipc parameters, a global memory mapper re-
quires only the Zdt parameter. Zdt assigns a data structure
to a memory type, whereas Xdtip and Ytipc assign data struc-
tures and configurations to specific bank instances.
The execution time savings obtained by using a global
mapper could be lost if the detailed mapper fails. If this oc-
curs, the global and detailed mappers need to execute multi-
ple times until a solution is found. Thus, it is very important
to ensure that the global mapper produces an assignment
that can be successfully detailed mapped. In an ILP for-
mulation, this translates to having constraints in the global
mapper that take into account the number of instances of
each type of memory banks, the number of ports of each in-
stance, and the available width/depth configurations of the
type.
4.1.1 ILP Pre-processing
For each design being mapped, the global mapper initially
pre-processes some information in order to produce an ILP
formulation of the problem. This formulation will result in
0-7695-0990-8/01/$10.00 (C) 2001 IEEE
128x116x8 16x8
(112)
128x116x8 16x8
(112)
128x116x8 16x8
(112)
FP WP
WDPDPdt dt
dtdt
128x116x8 16x8
(64) (64)
(120)
Figure 2. Space and Ports Allocation Example
a fast ILP solution that will successfully go through detailed
mapping.
Three main parameters need to be computed to allow a
simple yet powerful constraint formulation: CPdt or the total
number of consumed ports of memory type t if data struc-
ture d is assigned to it. CWdt or the “ceiling” value of the
width of data structure d if assigned to bank type t. And fi-
nally, CDdt or the “ceiling” value of the depth of data struc-
ture d if assigned to bank type t.
First, the total number of consumed ports CPdt depends
on the size of data structure d with respect to the size and
number of ports of bank type t. There are four components
to CPdt and they are illustrated by the following example.
A 55x17 data structure is to be mapped onto one type
of memory bank that has 3 ports, and four ratio configura-
tions: 128x1, 64x2, 32x4, 16x8. Since the data structure
requires more than one instance, Figure 2 shows a mapping
where the assigned instances can be visualized as a rectan-
gular area. The width, 17, of the data structure will be di-
vided into: 8, 8, and 1; and the depth will thus have to be 16
to match the 16x8 configuration. Hence, the upper left in-
stances in the rectangle are fully utilized with configuration
16x8 selected. The upper right instances, that form a sin-
gle column, are partially utilized with configuration 128x1
selected since there remained one bit from the width of the
structure. The lower left instances, that form a single row,
are partially utilized with configuration 16x8 selected. Fi-
nally, the lower right instance, that is a single instance, is
partially utilized with configuration 128x1 selected. Note
also that, since the memory type has 3 ports, all ports in
the upper left instances are consumed, and some ports in all
other instances are consumed. The “O” next to a port signi-
fies that the port is used by this data structure. The “X” next
to a port denotes that the port is wasted. And, the check
mark next to a port indicates that the port is available for
other data structures. Finally, the number between paren-
theses next to available ports indicates the total number of
bits that are still unused in the instances.
Thus, following the scheme of Figure 2:
8d 2 DS; 8t 2 PB;
CPdt = FPdt +WPdt +DPdt +WDPdt
where
FPdt =

Dd
Dtα



Wd
Wtα

Pt
α refers to the configuration with the smallest width such
that Wtα is greater than or equal to Wd . If Wd is larger than
all configuration widths, then α is the configuration with
the largest width. For multi-configuration banks, the best
configuration yields the smallest numbers of instances used
while trying to match the width of the bank with the width
of the data structure.
if ((WdmodWtα) == 0) then W Pdt = 0 else:
WPdt =

Dd
Dtα

 consumed ports
 
Dtα;Dtβ;Pt

where β refers to the configuration with the smallest width
such that:
Wtβ WdmodWtα
consumed ports() is defined in Figure 3. And:
DPdt =

Wd
Wtα

 consumed ports((DdmodDtα) ;Dtα;Pt)
Finally, if ((WdmodWtα) == 0) then WDPdt = 0 else:
WDPdt = consumed ports
 
(DdmodDtα) ;Dtβ;Pt

The second parameter, CWdt , indicates the total width
that is consumed by data structure d if assigned to bank type
t. After finding configuration α and β as described above:
CWdt =

Wd
Wtα

Wtα+Wtβ
Similarly, the third parameter, CDdt , indicates the total
depth that is consumed by data structure d if assigned to
bank type t:
CDdt =

Dd
Dtα

Dtα+ dDdmodDtαepow(2)
For data structures to co-exist on the same in-
stance of a bank type (on different ports of the bank),
consumed ports() is used to compute the fractional num-
ber of ports consumed by a data structure. If the entire data
structure fits on a single instance, then consumed ports()
actually is the total number of ports consumed by the data
0-7695-0990-8/01/$10.00 (C) 2001 IEEE
function consumed ports(Dd , Dt , Pt)
begin
depth = round(Dd ; pow(2))
f raction = depthDt
EP = d f ractionPte
returns EP
end
Figure 3. Fractional Port Consumption
structure; However, when multiple instances are required,
consumed ports() returns the number of ports consumed
on each used instance.
Thus, for an n-ported memory, the algorithm in Fig-
ure 3 computes consumed ports(). The main purpose of
this algorithm is to make sure that once data structures are
mapped onto physical banks, no adders or other logic would
be required to perform memory accesses. To do so, each
assigned fraction in an instance is rounded to the closest
power-of-two depth that corresponds to the configuration
with the largest width. In addition, the port assignment fol-
lows the order of decreasing fraction sizes. This ensures
that several fractions can be assigned to different ports of
an instance without the need of extra logic for base address
generation. It also ensures that the memory space for each
fraction is mutually exclusive, thus avoiding unwanted con-
flicts.
Note that consumed ports() in Figure 3 is optimal for
Pt = 2. There is a waste of ports when Pt > 2; but since the
majority of on-board and on-chip memory banks are either
single or dual ported, this problem is not very pronounced
currently. Improvement to this algorithm is part of future
work.
Hence, a multi-ported memory bank can only be divided
in a fixed number of ways. For instance, the general space
allocation for a 3-port memory with 16-word depth is shown
in Table 2. The algorithm in Figure 3 rejects the (8, 8, 0)
configuration since it estimates that 8 words require two
ports each thus requiring a total of 4 ports. This over-
estimation does not occur when a bank type has only two
ports.
4.1.2 Constraint Formulation
Next, the ILP formulation is constructed based on the fol-
lowing constraints:
 Uniqueness constraints: Each data structure should
be mapped to exactly one type of physical bank:
8d 2 DS; ∑
t2PB
Zdt = 1
3-port 16-word bank
Port 1 Port 2 Port 3
(# words) (# words) (# words)
16 0 0
8 8 0
8 4 4,2,1,0
8 2 2,1,0
8 1 1,0
8 0 0
4 4 4,2,1,0
4 2 2,1,0
4 1 1,0
4 0 0
2 2 2,1,0
2 1 1,0
2 0 0
1 1 0,1
1 0 0
0 0 0
Table 2. Example on Allocation Options
 Port constraints: Each memory type should have
enough ports for all the data structures assigned to it:
8t 2 PB; ∑
d2DS
Zdt CPdt  Pt  It
From the pre-processing step, the number of ports that
each data structure would consume from each type of
memory is computed. The sum of all consumed ports
of a type should be less than or equal to the total num-
ber of available ports of the type.
 Capacity constraints: Each bank type must have
enough space to contain all data structures assigned to
it.
8t 2 PB; ∑
d2DS
Zdt  (CWdt CDdt) It Wt[1] Dt[1]
In the event where the life-cycles of different data
structures do not conflict, the capacity constraint is
slightly modified to allow overlapping in the memory
space.
Note: The uniqueness constraint ignores the specifica-
tions of the memory banks; whereas the port and capacity
constraints cater to the specific features of each type. It is
these latter constraints in addition to some parameters com-
puted in the pre-processing step, that ensure successful de-
tailed mapping.
4.1.3 Objective Formulation
The objective of the ILP model is to optimize the perfor-
mance and minimize the interconnection cost of the mem-
ory assignment. The cost function takes the form:
minimize[Cost1 α1 +Cost2 α2 + :::+Costn αn]
0-7695-0990-8/01/$10.00 (C) 2001 IEEE
whereαi is a weight coefficient used to normalizeCosti with
respect to all other cost components.
Three cost components are depicted below.
 Latency cost: Assuming the number of reads is equal
to the number of writes for every data structure:
∑
d2DS
∑
t2PB
Zdt Dd  [RLt +WLt ]
 Pin delay cost: Assuming the number of pins tra-
versed from the processing unit to reach the memory
bank is inversely proportional to the clock speed:
∑
d2DS
∑
t2PB
Zdt Dd Tt
 Pin I/O cost: The larger the depth and width of a data
structure the more pins it will need in the event of off-
chip physical banks:
∑
d2DS
∑
t2PB
Zdt  (dlog2(CDdt)e+CWdt)Tt
4.2 Detailed Memory Mapping
Once global mapping is complete, the task of detailed
mapping is to consider each bank type at a time and all data
structures assigned to it. It might re-shape the data struc-
tures in case they are larger than a single instance of the
bank type, it might assign configurations for each port of
each bank instance of the type, and it will assign specific
instances for each data structure segment. Thus, the en-
tire problem is serialized into one main formulation (global
mapping) followed by a collection of formulations (detailed
mapping on each bank type).
Since global memory mapping ensures a successful de-
tailed assignment, the task of the detailed mapper is sim-
plified. It cannot further optimize the assignment based on
the optimization criteria established in global mapping. In-
stead, it can aim at optimizing the detailed assignment based
on different optimization factors.
An ILP-based formulation for the detailed memory map-
per was developed. For brevity, the mathematical formula-
tion is not reproduced in this paper. For every type of mem-
ory bank, an ILP problem is formed and solved. The aim
is to assign data structures to specific ports of specific in-
stances of the bank, possibly requiring to fragment the data
structure to fit on several instances. Optimization factors
include trying to reduce on-chip interconnection congestion
and reducing data structure fragmentation.
5 Results
The ILP model presented in the previous section was ex-
ecuted for designs of different sizes. Also, a complete mem-
ory mapper [9] was executed on the same designs. CPLEX,
Data Physical Complete Global
Structures Banks Approach Approach
#segments Total Total Total Execution Execution
#banks #ports #configs Time (sec) Time (sec)
22 13 25 50 8.1 7.8
32 23 45 100 29.4 25.3
32 45 77 150 99.3 50.7
42 45 77 150 130.4 59.2
32 65 105 150 172.7 105.1
62 65 105 150 411.0 140.4
32 180 265 375 518.3 216.4
62 180 265 375 1225.0 309.0
132 180 265 375 2989.0 489.0
Table 3. ILP Execution Times
a commercial linear programming solver [10], was used.
Both the number and types of logical segments as well as
physical banks were varied. Table 3 shows the execution
time for both the complete formulation approach and the
global/detailed formulation approach. The platform was a
SUN Ultra-30 (248MHz with 128MB RAM) for designs of
various sizes: For logical memories, the number of seg-
ments represents the main complexity parameter in the ILP
formulation. Similarly, for physical memories, the three
complexity parameters are: the total number of physical
banks, the total number of ports summed over all instances
of all bank types, and the total number of possible configu-
ration settings summed over all multi-configuration ports of
all bank types. The execution time is given in seconds.
It is clear that for large designs, the complete approach
becomes impractical compared to the global/detailed ap-
proach. Note that the execution times shown for the
global/detailed formulation include all pre-processing steps.
It can be seen that for relatively small designs, the differ-
ence in the two approaches decreases. This is because for
the global/detailed technique, the setup time required for
pre-processing and for ILP-processing becomes the domi-
nating factor.
The plot in Figure 4 gives a visual rendering of the re-
sults. The X-axis represents the different design points that
are ordered in the increasing size of the problem (corre-
sponding to the increasing row number in Table 3). For the
sake of clarity, a line is plotted to connect the data points;
it does not represent the performance of the algorithm be-
tween the test cases.
6 Ongoing Work
The global versus detailed memory mapping paradigm,
introduced in this work, eased the ILP formulation while
conserving the quality of the resulting assignment. Large
designs can be mapped faster and more efficiently than
in the case of complete ILP formulation. In addition, the
pre-processing of the data simplified the complexity of the
0-7695-0990-8/01/$10.00 (C) 2001 IEEE
0500
1000
1500
2000
2500
3000
0 1 2 3 4 5 6 7 8 9
E
x
e
c
u
t
io
n 
ti
me
 (
se
c)
Design
Complete Approach
Global/Detailed Approach
Figure 4. Complete versus Global/Detailed
Execution Times
global mapping formulation and ensured a successful de-
tailed mapping.
As part of ongoing and future work, the following is-
sues are being considered. First, the consumed ports() al-
gorithm needs to be improved for memory banks with more
than two ports. Second, in the case of a single processing
unit, all design logic is mapped onto one hardware area, and
all logic areas are assumed equidistant from each physical
bank. The model needs to be enhanced to support multi-
ple processing units. Finally, arbitration is not taken into
consideration in this paper; in other words, two logical seg-
ments will not be mapped onto the same port. In the event
of RAM limitation, the model could allow data structures
to overlap at the price of adding conflict resolution to the
objective function.
References
[1] Altera Corporation. “APEX 20K Programmable Logic De-
vice Family Data Sheet”, March 2000.
[2] Altera Corporation. “FLEX 10K Embedded Programmable
Logic Family Data Sheet”, May 2000.
[3] C. J. Tseng and D. Siewiorek. “Automated Synthesis of
Data Paths in Digital Systems”. In IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems,
volume 5, pages 379–395, July 1986.
[4] D. D. Gajski, N. D. Dutt, A. C. Wu, and S. Y. Lin. “High-
Level Synthesis, Introduction to Chip and System Design”.
Kluwer Academic Publishers, 1992.
[5] D. Karchmer and J. Rose. “Definition and Solution of
the Memory Packing Problem for Field-Programmable Sys-
tems”. In Proceedings of International Conference on Com-
puter Aided Design, pages 20–26. ACM Press, November
1994.
[6] F. Catthoor, et al. “Custom Memory Management Method-
ology”. Kluwer, 1998.
[7] G. De Micheli. “Synthesis and Optimization of Digital Cir-
cuits”. McGraw-Hill, 1994.
[8] I. Ahmad and C. Y. Chen. “Post-Process for Data Path
Synthesis”. In Proceedings of International Conference on
Computer Aided Design, pages 276–279. ACM Press, 1991.
[9] I. Ouaiss and R. Vemuri. “Hierarchical Memory Mapping
During Synthesis in FPGA-Based Reconfigurable Comput-
ers”. In Proceedings of Design Automation and Test in
Europe (to appear). IEEE Computer Society Press, March
2001.
[10] ILOG Incorporation. “Using the CPLEX Callable Library”.
http://www.cplex.com.
[11] M. Balakrishnan, et al. “Allocation of Multiport Mem-
ories in Data Path Synthesis”. In IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems,
volume 7, pages 536–540, April 1988.
[12] M. Weinhardt and W. Luk. “Memory Access Optimization
and RAM Inference for Pipeline Vectorization”. In Pro-
ceedings of International Workshop on Field-Programmable
Logic and Applications, pages 61–70. Springer, September
1999.
[13] P. Jha and N. Dutt. “High-Level Library Mapping for Mem-
ories”. In ACM Transactions on Design Automation of Elec-
tronic Systems, pages 566–603. ACM Press, July 2000.
[14] P. R. Panda, N. Dutt, A. Nicalau. “Memory Issues In Em-
bedded Systems-On-Chip”. Kluwer, 1999.
[15] S. Wilton. “Architectures and Algorithms for Field-
Programmable Gate Arrays with Embedded Memory”. PhD
thesis, University of Toronto, 1997.
[16] T. Kim and C. L. Liu. “Utilization of Multiport Memories
in Data Path Synthesis”. In Proceedings of the 30th Design
Automation Conference, pages 298–302. ACM Press, June
1993.
[17] W. Ho and S. Wilton. “Logical-to-Physical Memory Map-
ping for FPGAs with Dual-Port Embedded Arrays”. In Pro-
ceedings of International Workshop on Field-Programmable
Logic and Applications, pages 111–123. Springer, Septem-
ber 1999.
[18] Xilinx, Inc. “Virtex 2.5V Field Programmable Gate Arrays”,
September 2000.
0-7695-0990-8/01/$10.00 (C) 2001 IEEE
