Computation Acceleration On Sgi Rasc: Fpga Based Reconfigurable Computing Hardware by Ernest Jamro et al.
Ernest Jamro∗, Marcin Janiszewski∗∗,
Krzysztof Machaczek∗∗, Pawel Russek∗, Kazimierz Wiatr∗,
Maciej Wielgosz∗
COMPUTATION ACCELERATION ON SGI RASC:
FPGA BASED RECONFIGURABLE COMPUTING
HARDWARE
In this paper a novel method of computation using FPGA technology is presented. In sev-
eral cases this method provides a calculations speedup with respect to the General Purpose
Processors (GPP). The main concept of this approach is based on such a design of comput-
ing hardware architecture to ﬁt algorithm dataﬂow and best utilize well known computing
techniques as pipelining and parallelism. Conﬁgurable hardware is used as a implementa-
tion platform for custom designed hardware. Paper will present implementation results of
algorithms those are used in such areas as cryptography, data analysis and scientiﬁc com-
putation. The other promising areas of new technology utilization will also be mentioned,
bioinformatics for instance. Mentioned algorithms were designed, tested and implemented on
SGI RASC platform. RASC module is a part of Cyfronet’s SGI Altix 4700 SMP system. We
will also present RASC modern architecture. In principle it consists of FPGA chips and very
fast, 128-bit wide local memory. Design tools avaliable for designers will also be presented.
Keywords: custom computing, single-purpose processors, FPGA, high performance comput-
ing, SGI RASC
AKCELERACJA OBLICZEŃ NA PLATFORMIE SGI RASC:
MODULE OBLICZEŃ ZA POMOCĄ LOGIKI
REKONFIGUROWALNEJ
Autorzy prezentują nową metodę prowadzenia obliczeń wielkiej skali, opartą na układach
FPGA. W szczególnych przypadkach jej zastosowanie prowadzi do skrócenia czasu obliczeń.
Podstawą metody jest prowadzenie obliczeń za pomocą architektur obliczeniowych pro-
jektowanych dla danego algorytmu. Ponieważ architektura stworzona została specjalnie
dla zadanego algorytmu, lepiej wykorzystuje możliwości równoległej i potokowej realizacji
obliczeń. Jako platformę realizacji architektur dedykowanych zastosowano układy rekonﬁg-
urowalne. Artykuł prezentuje także wyniki zastosowania wspomnianej techniki w takich ob-
szarach, jak kryptograﬁa, analiza danych i obliczenia naukowe podwójnej precyzji. Wskazano
również na inne dziedziny nauki, gdzie opisywana technika jest z powodzeniem stosowana
∗ ACC „Cyfronet” AGH, Dept. of Electronics AGH, University of Science and Technology,
Krakow, Poland
∗∗ ACC „Cyfronet” AGH, University of Science and Technology, Krakow, Poland
Computer Science • Vol. 9 • 2008
21(np.: bioinformatyka). Zrealizowane algorytmy były uruchomione i przetestowane na zain-
stalowanym w ACK Cyfronet AGH module SGI RASC, będącym częścią systemu SMP Al-
tix 4700. Przedstawiono architekturę zastosowanego modułu RASC oraz narzędzia i metody
projektowania dostępne dla programistów.
Słowa kluczowe: sprzętowa akceleracja obliczeń, procesory dedykowane, FPGA, obliczenia
wielkiej skali, SGI RASC
1. Introduction
The importance of numerical calculation is unquestionable at science and engineering
today. For example, numerical modeling replaced the major part of practical exper-
iments in many disciplines. This is due to the cost and eﬃciency of such approach.
It is better and cheaper to proceed natural experiment with simulation to anticipate
natural objects features in the way of numerical analysis. That is true for chemistry,
biology, physics, aerodynamics, construction, etc. Sometimes there is also a need for
computer aided experimental data processing. When experiment results in enormous
amount of information usage of computer is essential.
There are two main approaches how to achieve necessary computing performance
and throughput. One is to build multi-processors installations which are able to oﬀer
almost 1015 of ﬂoating point instructions executed per second today. An example of
such system is BlueGene/L installation in Lawrence Livermore National Laboratory
in California oﬀers 596 TFlops (1012) of peak computing power (XI 2007). Another
one is to combine several computing sites into a computing grids. Grid organizations
like EGEE (Enabling Grids for E-SciencE) bring together computing capabilities of
many computers using network infrastructure. Cost of building high computing power
infrastructure is usually enormous. The maintenance of such systems is also expensive
even if we consider only electric power costs. Regarding the power we must take both
supply power and air conditioning costs into account.
Up to now the constant demand for higher computing capabilities was satisﬁed by
progress in semiconductor technology. According to formulated in 60’s Moore’s Law
a number of transistors on a single chip is doubled every 18 months. That was true
since the ﬁrst microprocessor appeared on the market but it seems that the Moore’s
Law rule is going to collapse in the near future. With 45nm semiconductor technology,
the size of transistors are closer and close to atom size (ca. 0.1nm). Nobody proposed
sub-atom switching device at the moment, so we will probably hit the wall in the near
future and transistor resize technique will not be possible any more. On the other hand
architectures of a today processors are still improved. There are a lot of improvements
to perform software algortihms better. Today processor architecture is much more
sophisticated than the architecture of ﬁrst microprocessors but it still seems not to
be optimal for executed tasks. Of course computer and processor designers know the
typical algorithm bottleneck and try to avoid them but the true is that processors
are optimized for the algorithms treated as a set of problems. From the perspective
22 E. Jamro, M. Janiszewski, K. Machaczek, P. Russek, K. Wiatr, M. Wielgoszof a single algorithm and their properties today’s processors architecture is far from
optimal. It is because for diﬀerent algorithms optimal architecture is diﬀerent. General
Purpose Processor (GPP) is a single universal hardware machine designed to perform
implemented instructions to complete any algorithm. It is the basic paradigm of GPP
based computations.
2. Field Programmable Gate Arrays
Gates are the basic building blocks of any digital device. They are itself elementary
digital devices but they are such simple that can only perform fundamental logical op-
erations like: NOT, AND, OR. Gates are made of transistors. Any computing device
such as adder, multiplier or register must be constructed using gates. Semiconductor
devices like microprocessors are usually called full-custom devices. That means that
the entire process of their design and fabrication starts from blank semiconductor
wafer. This is very expensive way of device design as a lot of unique preparation
steps must be performed to start chip fabrication. Each extra step costs of course.
In fact, this fabrication preparation costs don’t matter if the volume of produced
semiconductor chips is high because it is shared among all sold devices. If someone
anticipates average, not very high, ﬁnal quantity of designed chip it is better to use
another approach – so called semi-custom process. In semi-custom process some of
the steps are common to the set of diﬀerent silicon devices and so the total quantity
of ﬁnal devices that use the same process increase. In such an approach the number of
fabrication preparation steps is reduced and also costs are reduced as well. There are
many semi-custom technologies. One of them is called Mask Programmable Gate Ar-
rays (MPGA). In MPGA technology silicon wafers go with gates already implemented
but not connected. Such prepared intermediate product is used for ﬁnal implementa-
tion. The necessary steps include only preparation of connections between gates. To
create connections an appropriate mask must be designed. That makes the name of
technology.
In the late 80’s new interesting semiconductor technology appeared. It was an ar-
ray of logic gates like in MPGA technology but in opposite, the connections were also
already implemented. Technology was useful in diﬀerent designs because the imple-
mented connections were ﬂexible (they could be freely conﬁgured). That was the birth
of Field programmable Gate Arrays (FPGA) technology. Implemented connections
consisted of both wires and programmable switches those could be set by electrical
ﬁeld applied to the device. Unlike in MPGA where process of connections conﬁgu-
ration must be performed in semiconductor factory, analogous process in FPGA can
be performed at customer/user location by a downloading of a proper conﬁguration
ﬁle. During conﬁguration FPGA behaves like memory because data is stored inside
their structure. Each bit of downloaded data sets appropriate switch to be ’on’ or
’oﬀ’. According to that the design process is not only cheaper than in any other semi-
conductor technology but also much faster and easier. What more, single FPGA chip
can be conﬁgured many times. When it is necessary at any time conﬁguration can be
Computation acceleration on SGI RASC (...) 23replaced by diﬀerent one. Thanks to that feature single FPGA chip can acts as a set
of diﬀerent chips (not simultaneously of course).
3. Custom computing
Hardware acceleration of algorthms is a method for faster algorithm execution.
Speedup is achieved by a proper adoption of a host platform architecture. If the
architecture of computer is ﬁxed the only way to accelerate algorithm execution is to
write and compile a software code with target system structure in mind. We call this
software acceleration of algorithm. For example, if the cache size is known, program-
mer should avoid to perform operations on a bigger than cache size data structures
and if it is possible to split such structures into smaller pieces of data to process
them separately. When the architecture of the processor or computer is modiﬁed for
faster execution of distinguished instructions we call this technique hardware accel-
eration. For GPP hardware acceleration means processor or computer architecture
optimization to allow for eﬃcient execution of selected operations which are common
bottlenecks in majority of algorithms. Multiply and accumulate (MAC) is an exam-
ple of such a operation. It is broadly used in linear algebra, digital signal analysis,
image processing, etc. For contemporary GPPs, thanks to hardware tailoring, MAC
operation is performed with execution speed close to processors peak computing pow-
er. This in turn makes programmers to write their codes using MAC operations as
frequent as possible. In this way hardware and software ﬁt each other. The MAC
operation is an example how dedicated processor architecture can support execution
of selected operation.
Due to a big variety of algorithms, GPP can get limited proﬁts out from hardware
acceleration techniques. When the family of carried out problems is narrowed down
the potential of presented techniques grows. Usually we consider algorithms related
one another to ﬁnd out as much common features as possible. Then we build custom
hardware that ﬁt those characteristics.
Hardware acceleration is beneﬁcial in computing power increase and in saving
energy, so what stops us from it wide universal use? The shortest answer for this
question is: “Costs”. As it was mentioned before, digital system development process
requires several steps. Those are: behavioral modeling, validation of model, structural
design, project simulation, prototype construction, testing. Some or all of the steps are
repeated until system ﬁts all the requirements. If digital system technology platform
is an integrated circuit, those stages are quite expensive. For example to get a real
prototype of semiconductor chip, layer masks are necessary. In 65nm process cost of
a single set of masks is 1 Million dollars. All costs beard to produce the very ﬁrst unit
of the system are called Non-Returned-Expenso (NRE) costs. The NRE-cost of each
digital system must be shared by a part of price of all sold items. That is why only
devices expected to be sold in very high quantities can be considered to be oﬀered by
companies. This condition is fulﬁlled in a case of GPPs and GPs of course. Status of
the considered algorithm must be special to aspire to hardware acceleration.
24 E. Jamro, M. Janiszewski, K. Machaczek, P. Russek, K. Wiatr, M. WielgoszFrom the above argumentation, it is obvious that in case of custom speciﬁc al-
gorithms the only way to use hardware acceleration method is to use low NRE tech-
nology. That is FPGA technology.
4. Reconﬁgurable computing
The most expensive design is for full-custom technology. Semi-custom technology is
less expensive but the cheapest is FPGA technology. Unfortunately cheaper means
also slower and less capacious. That must be kept in mind when HPC algorithms
are moved to FPGA. Approximately, FPGA are 10 times slower than full custom
processors and oﬀer 3 times less resources. We can conclude that full-custom devices
have 30 times functional advantage over FPGA. Despite that, as we shall see, in some
cases FPGA can perform better even under such circumstances. The best architecture
to drive proﬁt form hardware acceleration is to build system provided with both GPP
and FPGA. Such co-processor based solutions are well known but novelty of this
particular proposal is based on ability of a co-processor to reconﬁgure its structure
and to work as free deﬁned hardware processor architecture. It acts according to
downloaded at the moment conﬁguration ﬁle (Fig. 1).
Fig. 1. Reconﬁgurable logic can replace several hardware accelerators
Capability for hardware acceleration comes from utilization of ﬁne-grain paral-
lelism. FPGAs can be used to build custom dataﬂow processors on an algorithm basis
that would do away with the instruction fetch and decode overhead and a serial nature
of the Von Neuman architecture. With this extra eﬃciency it is possible to achieve
an increase of computational throughput.
The lack of high level programming ecosystem is a problem of reconﬁgurable
computing. The GPPs have rich support of programming tools those have evolved over
decades with eﬀort of millions of individuals. The FPGA has adopted semiconductor
technology from GPPs and also a lot of compilers technology can be adopted but
there is still a lack of methods for full automation of design process.
Computation acceleration on SGI RASC (...) 255. Computing platforms
The most important issue when hardware acceleration platform is considered is fast
and eﬃcient data transfer between system operation memory and hardware acceler-
ator or (depends on architecture) hardware accelerator cache memory.
In the ﬁrst, PC based reconﬁgurable computing systems, FPGA accelerators were
attached to the computer through peripheral bus like PCI or PCI-X bus for instance.
Such solution has limitations those keep down good acceleration results. In such case,
in opposite to reconﬁgurable accelerator, processor disposes of very fast local system
bus with instant access to system memory. Hardware acceleration is not so attractive
because to hardware accelerator processing time, data transfer delay from and to
system memory must be added. Even if GPP performs slower than FPGA it can be
recognized as better solution because of quick data transfer. In taught computing
applications FPGA must be closer integrated with the rest of the computing system.
There is another approach implemented in HPC systems where FPGA is treated
as integral system component rather than peripheral. It is linked directly to processor’s
resources through high speed connections and so overcomes the biggest bottleneck of
FPGA co-processing. To provide maximum performance, FPGA co-processor has the
same capabilities of operational memory access as GPP. For example, in Cray XD1
reconﬁgurable system, FPGA is integrated with processor system bus. Respectively,
SGI Corp. proposed adequate solution in its Altix family systems. Altix is a family of
SGI’s SMP (Siymmetric Multi Processing) solutions. Its distinguishing feature is that
each processor has both fast access local memory and slower access to local memo-
ries of other processors in the system. Data exchange between processors is achieved
thanks to NUMALink bus. Thanks to NUMALink bus, data exchange between pro-
cessors is relatively fast. The NUMALink interconnect is hierarchial system bus. It
allows for global addressing and scalability of SMP system. Maximum NUMALink
data transfer is 6,4 GB/s. The integral component of Altix system can be Reconﬁg-
urable Application Speciﬁc Computing (RASC) module. It is SGI’s technology that
enabling users to develop application speciﬁc hardware using reconﬁgurable logic el-
ements. The SGI RASC is tightly integrated with NUMALink. From the hardware
perspective FPGA is no longer co-processor mode in this model. With NUMALink
FPGA has access to global shared memory and there is no need to load and unload
data. The RASC is coupled with two Virtex4LX200 FPGA chips [15]. Each oﬀers
200k of reconﬁgurable logic cells. Additionally there are two blocks of 64 MB QDR
RAM memory. This memory acts like second level cache for FPGA. First level cache
is implemented inside Virtex4LX200 structure and is called BlockRAM. Bidirectional
data interface implemented for FPGA has 128-bit width and is clocked with frequency
of 200MHz.
All the following implementations, presented in this paper were executed on SGI
RASC, which is installed in Academic Computing Centre „Cyfronet” AGH University
of Science and Technology.
26 E. Jamro, M. Janiszewski, K. Machaczek, P. Russek, K. Wiatr, M. Wielgosz6. FPGA High Performance Computing
At present, the potential of FPGA as reconﬁgurable computing engines in HPC has
been recognised. Recent advances in speed and density bring FPGA to supercom-
puting solutions. Also the progress in programming environments for reconﬁgurable
computing is important in this process. Thanks to HLLs it is no longer necessary to
be hardware engineer to get advantages out of reconfugurable computing.
There are also many tasks in typical computing job those do not make sense to
be accelerated into FPGA fabric. Conventional procesors are highly eﬃcient in many
common computational problems. On the other hand there are many algorithms those
do not ﬁt well into GPP ﬁxed structure. For applications that are highly parallel on
ﬁne grain level and spend much of their computational time in integer and ﬁxed
point calculations FPGA can be an attractive alternative. Reconﬁgurable computing
friendly algorithms also rely primarily on local data. Such application stand to gain
10 times or more overall application performance with FPGA acceleration.
In biomedical applications (like DNA sequence alignment for example) that are
extremely compute intensive algorithms are well suited to hardware acceleration. For
example a Smith Waterman algorithm runs 50× speed of GPP on average size FP-
GA [6, 16]. The Smith Waterman algorithm is for comparing DNA and amino acid
sequences against known genes and proteins to point the ideal candidate. That is
extreme case example but here are some other:
• In seismic imaging seeking of patterns in sensor data can gain 17×acceleration
[7].
• For simulation when performing vehicular traﬃc simulation, it is 300× faster on
FPGA such like Xilinx VirtexII(XC2V6000) relative to 1,7GHz Xeon.
Other areas are computational chemistry, encryption, automation, security, ge-
ology, ﬁnance applications, etc. Seismic processing and simulation incorporate signif-
icant amount of single precision ﬂoating point operation which is typically considered
a no-go for FPGAs.
There is also a class of green computing solutions. The FPGAs demonstrates
improved computing performance per watt, per dollar and cubic meter over traditional
processors. Where we are considering over 100W for high-end microprocessor, FPGA
typically consume around 15W when executing high performance algorithms [11].
We are able to reduce electricity costs, air conditioning costs and machine room ﬂoor
space through reduced thermal density.
7. Example hardware implementations
For the evaluation purpose, SGI RASC platform was purchased and installed as a
part of Altix4700 system. Several algorithm and functions were implemented. Here
are brief description and implementation results achieved. These are intended as a
reference for evaluation of expected speedup for another candidate HPC functions
hardware implementation.
Computation acceleration on SGI RASC (...) 277.1. Pattern matching hardware acceleration
Bloom ﬁlter was realised in hardware for the pattern matching purpose [1]. Bloom
ﬁlter is a method suitable for matching of a large set of binary or character patterns
in input data. Input data is sequentially hashed and then computed hash is compared
to the stored search pattern’s hashes. In implemented hardware both hashing and
matching process are executed in parallel.
Hashing is a compression process of ’w’-bit information into ’h’-bit information,
where w>h. Usually it is realised by a process described by polynomial division.
The goal is to achieve even distribution of the input data in the output data set.
Traditionally hashing requires processing of one byts (character) at each clock cycle.
In our solution, a hash-compare process is 16×multiplied to ﬁt SGI RASC maximum
data throughput and hardware interface width. The clock frequency of the solution is
100MHz and 16 bytes of data are transferred at each clock cycle. There is 16 parallel
Bloom ﬁlter structures to process 16bytes at once. As a result hardware matcher
can search through 16GBytes of data per second. That means that in practice our
matcher performance is limited my IO operation performance – if data is stored on
hard disk for example.
The important limitation of Bloom ﬁlter algorithm is that match process is not
deterministic. Every match reported by algorithm should be additionally conﬁrmed
by direct matching. Bloom ﬁlter works as selector that limits number of candidates for
direct matching algorithms. The bigger ’h’ is the Bloom algorithm is more reliable. In
practice in hardware big ’h’ value means a big memory necessary to store data. The
other parameter that reﬂects in ﬁlter reliability is a number of searched patterns. If the
size of the memory is small and there is a lot of positions marked in it, the number of
wrong hits increases. In our implementation to limit resources utilization when a big
number of patterns are searched we implemented up to six Bloom ﬁlters inside FPGA
structure. The number of ﬁlters depend on the maximum pattern length. There are
several implementations available which vary in number of pattern length and number
of ﬁlters implemented. Example implementation results are presented in Table 1.
Table 1
A set of six bloom ﬁlters implementation result.
Pattern lengths: 128, 192, 192, 256, 256 bits;
128 bit data interface; 42-bit hash length
Used FPGA resources 78%
Used FPGA memory resources 92%
Data throughput 1,6GByte/sec
7.2. Implementation of AES coder
Contemporary computing and telecommunication systems require data security. One
of the safety measures is cryptography. The simplest way to code the data is to use
28 E. Jamro, M. Janiszewski, K. Machaczek, P. Russek, K. Wiatr, M. Wielgoszsoftware program of the coder and decoder executed on microprocessor. In many
application such a solution is not acceptable due to time constrains. The obvious
solution is to implement cipher algorithm as hardware accelerator.
In our case, Advance Encoding System (AES) [12] was implemented in FPGA.
The AES is a successor of an obsolete Data Encryption Standard (DES). The AES
is based on Rijndael algorithm, which won the competition in 1997 for a new safer
coding algorithm. Rijndael algorithm codes 128-bit, 192-bit or 256-bit size blocks of
data. In our implementation basic version of algorithm was implemented where size
of data block and cipher key is 128-bit. This is known as AES-128 algorithm. Rijndael
is relatively very fast and very safe algorithm. Up to now, despite many ties nobody
was able to crack the AES coded data. Before 128-bit block of data is coded it is
organized as an 4×4 table of 8-bit elements. The same data representation is adopted
for the key. These tables are sole data for complete data encryption process. The
whole coding procedure consists of 9 rounds executed sequentially. All the round are
the same and consist of basic operations like XOR, data shift, integer add and LUT
(Look-Up Table) substitution. The only diﬀerence between rounds is that each of
them use diﬀerent cipher key. An appropriate set of nine keys is generated from single
key in key extension procedure. Decoding of data is very similar to coding process
and runs in an opposite direction. Detailed description of Rijndael algorithm can be
found in references [4].
In our implementation, coding and decoding processes are organized as a pipeline
process. In pipeline computation new data is applied to the processor while former
data is still processed at the further computational stages. Thanks to pipeline com-
putation, AES coder can accept a new data at each clock cycle. Beside acceleration
due to pipeline execution, there is also parallel computation in our architecture. At
each processing stage similar operations are executed in parallel for each element from
a data table. When each stage of algorithm is completed the temporary results are
latched by ragisters. At the next clock period that data is read by next processing
stage. Because of pipelining and several execution stages the coded data results ap-
pear on coder output after 11 clock cycles The latency of designed module is 11 clock
cycles. Timing results and logic resources of implemented AES processor are presented
in Table 2.
Table 2
Implementation results of AES algorithm
AES Implementation Coder Decoder
FPGA logic resources 40% 40%
Data throughput 21Gbit/seconds 16Gbit/seconds
For the reference authors found AES-128 bit assembler implementation for Pen-
tium4 3,2GHz. The data throughput for that solution was 1,5Gbit/s. Such we achieve
speedup of 0,067 for coding and 0,1 for decoding process.
Computation acceleration on SGI RASC (...) 297.3. Montgomery multiplier
RSA is the most common cryptographic algorithm for establishing secured connection
between two parties [13]. The key operation of RSA algorithm, involved in decryption
and encryption process is modular exponentiation. Main operations involved in com-
puting modular exponentiation are repeated modular multiplications and squaring.
In our research we have implemented Montgomery modular multiplication [10] – the
core operation involved in computing modular exponentiation.
Basic Montgomery algorithm consists of series of additions. Single addition, due
to carry propagation is hard to parallelise. without additional logic. Since argument
count thousands of bits special techniques must be used to make parallel execution
possible. In our case problem was addressed by dividing long addition carry chains
into parts, implementing mechanism allowing execution on all of them simultaneously.
We have implemented our module in Virtex4 LX200 FPGA device. Results for 2048
bit Montgomery multiplier for diﬀerent adder widths are shown in Table 3.
Table 3
Implementation results for 2048 bit Montgomery multiplier
Adder length [bits] FPGA logic Clock speed [MHz]
2 16% 200
3 15% 164
8 14% 165
16 14% 169
Analyzing speed of our module, theoretical speed of Exponentiation can be calcu-
lated. When using right to left binary algorithm and repeating modular multiplication
appropriate number of times, modular exponent can be calculated and comparison
between hardware and software implementations can be made (Table 4). Our fastest
architecture (for 2 bit adder lengths) shows speedup only for 32 bit GPP (General
Purpose Processors)
Table 4
Comparison of modular exponentiation execution time on three platform
Execution time
Exponent size [bit] Athlon XP 2600+(32bit) FPGA Athlon 64 3500+(64bit)
1024 9,4ms 5,5ms 2,7ms
2048 60ms 27ms 17ms
Further speedup however can be achieved by using higher radices in computa-
tions. This means that instead of executing binary algorithm and analyzing one bit
per cycle, larger numbers are analyzed. Such optimization allows consideration of
applying FPGA to hardware acceleration of RSA cryptography.
30 E. Jamro, M. Janiszewski, K. Machaczek, P. Russek, K. Wiatr, M. Wielgosz7.4. Implementation of exponential function
Many hardware examples of single precision FP exponential function hardware im-
plementations can be found contrary to eﬃcient double precision standard one [3],
which are unknown to the authors. This disproportion results from the fact that com-
monly known table-based or polynomial methods are not straightforward applicable
to this double precision elementary function. Therefore some novel solutions were
adopted to the proposed exp() calculation module not only to preserve compatibility
to double precision standard but also to achieve high processing speed (200 MHz) and
satisfying accuracy. The exp() module is fully pipelined (max. pipeline latency is 30
clks) [8]. The exp function is accelerated on SGI RASC [5] board with two Virtex-4
LX200 FPGA. The exp() function alone occupies less than 3% Virtex-4 LX200 FPGA.
Exp() arguments are fetched to the FPGA’s and results are sent back to processors
over the system bus working at speed of NUMAlink 6,4 GB/s. The exponential mod-
ule reaches the processing speed of 200 MHz, the external memory interface limits
the number of operation to two exp() every clock cycle per a FPGA. The overall
end-to-end algorithm execution speedup achieves 0,1 as compared to a sequential
implementation of the algorithm executed on a single 1,5 GHz Intel Itanium2 micro-
processor. If no seperate argument fetch is necessary for diﬀerent modules, due to low
resources consumption of the single exp() unit, up to 15 of the modules can ﬁt in the
Virtex 4 LX-200 which results in the huge speedup over the GPU implementation.
For example quantum chemistry calculations involves several exp() operation what is
regarded as the main advantage of FPGA [9]. It is worth mentioning that within the
module calculations are conducted in the ﬁxed-point standard. Only front and end
interfaces of the module are IEEE-754 compatible. That’s the reason of successful
implementation and total speedup of 0,0167 for all 15 modules working together.
7.5. Implementation of GEMM function
There is a set mathematical functions widely utilized in scientiﬁc computations. The
foundation of many calculations are linear algebra operations. To provide both soft-
ware portability between platforms and ability to eﬃcient software implementation of
most common used functions, standard linear algebra libraries were deﬁned. There are
a few linear algebra libraries, like LAPACK, LINPACK, etc. Operations that are the
part of linear algebra libraries like LAPACK and LINPACK seem to be too complex
for hardware implementation. There are also too many functions to implement them
all in hardware. Fortunately, all mentioned functions are higher level libraries and so
perform operation harnessing Basic Linear Algebra Subroutines (BLAS): lowest level
linear algebra library. BLAS deﬁnes three types of functions: vector-vector operations,
matrix-vector operations and matrix-matrix operations. Those operations are matrix
multiplications combined with transpositions and inversions. From the hardware im-
plementation perspective the matrix-matrix [2] operations are the most promising
because there is the best ratio of computations over data transfer. If we consider
simple matrix multiplications each row of ﬁrst matrix is multiplied with each column
Computation acceleration on SGI RASC (...) 31from second matrix. So each column is transferred from the memory only ones and
multiplied several times. There are also single precision and double precision ﬂoating
point operations deﬁned in BLAS. We implemented double precision matrix multipli-
cations performed by GEMM function of BLAS [14]. The GEMM function is also the
most frequently speed optimized function due to its wide usage in software. In each
computer system it is coded very carefully to achieve best performance. Thanks to
that when matrix multiplication is performed microprocessors achieve almost 90% of
their peak computing power whereas the sustained system performance rarely achieves
50%. This make GEMM very eagerly used function. For hardware implementation,
such big population of GEMM function in software is a good news. Thanks to that it
is worth implementation eﬀort.
Logical resources in Virtex4LX200 which is located on SGI RASC allow for imple-
mantation of 24 double precision Multiply and Accumulate (MAC) operators. Beside
general purpose logic, Virtex4 FPGA family oﬀers also dedicated for digital signal
processing blocks. So called DSP48 blocks. The DSP48 allows for eﬃcient MAC im-
plementation. Finally there are two types of MACs in matrix multiplication imple-
mentation: with and without DSP48 usage. Because MAC implementation requires
one multiplier, one adder and additionally steering logic it is possible to ﬁt 24 MACs
in Virtex4LX200 (6 of them based on DSP48 blocks). Resources utilized for MAC
implementation are presented in Table 5. Designed architecture is pipelined and it
can perform 24 MAC operation at each clock cycle. Considering, that RASC clock
is 200MHz we can achieve 9,6 GFLOPs of computing power. As a reference we can
quote the Ithanium2 1,5GHz power which is 6 GFLOP. It must be highlighted that
Itanium and Virtex4 are produced in the same 90nm semiconductor technology.
In practice computing power of FPGA and Itanium for double precision ma-
trix multiplication is the same. The number of MACs in Virtex had to be reduced.
With almost 100% of logic resource utilization we couldn’t fulﬁll appropriate timing
constrains to perform with 200MHz clock.
Table 5
Implementation results for GEMM function.
Matrix multiplier: 24 MACS, 128 bit data
interface
FPGA logic resources 50%
FPGA DSP48 resources 100%
Preformance 4,8GFlops
It is worth notice, that implemented hardware regarded double precision ﬂoating
point multiplication. If single precision would be considered we could overperform
Itanium2 which calculates with the same speed despite data precision. In FPGA
hardware implementation we could implement more MACs because single precision
requires less logic resources. According our evaluation, in single precision case we
could achieve 0,5 speedup. From the other hand there are dedicated processors like
32 E. Jamro, M. Janiszewski, K. Machaczek, P. Russek, K. Wiatr, M. WielgoszClearspeed CSX600 that oﬀers 24 GFLOPS for double precision matrix multiplication.
We would conclude that FPGA hardware accleration is not always the solution of best
choice.
8. Conclusions
The article regards reconﬁgurable high performance computing. Both basic principles
and example applications were presented in this paper.
Under some circumstances, reconﬁgurable computing which is custom hardware
oriented computing implemented in reconﬁgurable logic can be attractive choice if
higher computing power is necessary.
The reconﬁgurable custom hardware should be considered if there is no dedicated
hardware in ﬁxed-wired semiconductor technology. FPGA are conﬁguration ﬂexible
but this feature makes them slower than any other non-reconﬁgurable hardware. There
is also number of resources (logic gates) limitation in reconﬁgurable chips.
The FPGA accelerated algorithm shouldn’t rely on data transfer. If big amount
of data transfer is required over computation the advantages of reconﬁgurable logic
utilization are limited. Before selected algorithm can be successfully accelerated by
custom hardware, data representation should be also carefully considered. One should
always use as little data bits for data representation as possible to get higher speedup.
Also it should be tried to perform operations in ﬁxed-point precision rather than in
ﬂoating point for better results in FPGA.
Custom hardware speedup stems from pipelining and paralled execution so ac-
celerated algorithms should be likely to be performed in such a manner. There are
also some kind of operations where GPP are not eﬀective at all. If logical operations
on single bits had to be executed custom hardware is the best solution of choice.
Accelerated software should be characterized by deﬁnite computation kernel. In
practice for the complete problem solution, usually we rely on hybrid systems where
GPP is supported by FPGA. In accelerated applications only a few lines from com-
putational kernels are moved from software to hardware. In practice only few lines
converted to hardware can ﬁt the FPGA capacity. If the kernel is meaningful in cal-
culations, acceleration is more successful.
Reconﬁgurable computing is also a green, environment friendly computing. FP-
GA chips consume an order of magnitude less power. This allow for energy reduction
in both supply and cooling electrical power.
References
[1] Bloom B.H.: Space/time trade-oﬀs in hash coding with allowable errors. Commun.
ACM, 13(7), pp. 422–426, 1970
[2] Dongarra J.J., Du Croz J., Hammarling S., Duﬀ I.S.: A set of level 3 basic linear
algebra subprograms. ACM Trans. Math. Softw., 16(1), pp. 1–17, 1990
Computation acceleration on SGI RASC (...) 33[3] Doss C.C., Riley R.L. Jr.: Fpga-based implementation of a robust ieee-754 ex-
ponential unit. In FCCM ’04: Proceedings of the 12th Annual IEEE Symposium
on Field-Programmable Custom Computing Machines, pp. 229–238, Washington,
DC, USA, 2004. IEEE Computer Society
[4] Faria D.B., Cheriton D.R.: Dos and authentication in wireless public access net-
works. In WiSE ’02: Proceedings of the 1st ACM workshop on wireless security,
pp. 47–56, New York, NY, USA, 2002. ACM
[5] Silicon Graphics. SgiR rasct rc100 blade, dramatic application speed-up with next
generation reconﬁgurable compute technology. http://www.sgi.com
[6] Harris B., Jacob A.C., Lancaster J.M., Buhler J., Chamberlain R.D.: A banded
smith-waterman fpga accelerator for mercury blastp. International Conference on
Field Programmable Logic and Applications, 2007, FPL 2007, pp. 765–769, 27–29
Aug. 2007
[7] He C., Lu M., Sun C.: Accelerating seismic migration using fpga-based coprocessor
platform. In FCCM ’04: Proceedings of the 12th Annual IEEE Symposium on
Field-Programmable Custom Computing Machines, pp. 207–216, Washington,
DC, USA, 2004. IEEE Computer Society
[8] Jamro E., Wiatr K., Wielgosz M.: Fpga implementation of 64-bit exponential
function for hpc. International Conference on Field Programmable Logic and
Applications, 2007, FPL 2007, pp. 718–721, 27–29 Aug. 2007
[9] Wielgosz M., Piteron M., Jamro E., Russek P., Wiatr K.: Two electron integrals
calculation accelerated with double precision exp() hardware module. Reconﬁg-
urable Systems Summer Institute, RSSI proceedings, July 2007
[10] Montgomery P.L.: Modular multiplication without trivial division. Mathematics
of Computation, pp. 519–521, 1985
[11] Prasanna V.K.: Energy-eﬃcient computations on fpgas. J. Supercomput., 32(2),
pp. 139–162, 2005
[12] Federal Information Processing. Fips pub 197, advanced encryption standard
(aes), November 2001
[13] Rivest R.L., Shamir A., Adelman L.M.: A method for obtaining digital signatures
and public-key cryptosystems. Technical Report MIT/LCS/TM-82, 1977
[14] Wiatr K., Russek P.: Dedicated architecture for double precision matrix multiplica-
tion in supercomputing environment. IEEE Workshop on Design and Diagnostics
of Electronic Circuits and Systems, Cracow, April 2007
[15] Xilinx. Virtex-4 User Guide. http://www.xilinx.com, 2007
[16] Zhang P., Tan G., Gao G.R.: Implementation of the smith-waterman algorithm
on a reconﬁgurable supercomputing platform. In HPRCTA ’07: Proceedings of
the 1st international workshop on High-performance reconﬁgurable computing
technology and applications, pp. 39–48, New York, NY, USA, 2007. ACM
34 E. Jamro, M. Janiszewski, K. Machaczek, P. Russek, K. Wiatr, M. Wielgosz