Ara: A 1 GHz+ Scalable and Energy-Efficient RISC-V Vector Processor with
  Multi-Precision Floating Point Support in 22 nm FD-SOI by Cavalcante, Matheus et al.
1Ara: A 1 GHz+ Scalable and Energy-Efficient
RISC-V Vector Processor with Multi-Precision
Floating Point Support in 22 nm FD-SOI
Matheus Cavalcante,∗ Fabian Schuiki,∗ Florian Zaruba,∗ Michael Schaffner,∗ Luca Benini,∗† Fellow, IEEE
Abstract—In this paper, we present Ara, a 64-bit vector
processor based on the version 0.5 draft of RISC-V’s vector
extension, implemented in GLOBALFOUNDRIES 22FDX FD-SOI
technology. Ara’s microarchitecture is scalable, as it is composed
of a set of identical lanes, each containing part of the processor’s
vector register file and functional units. It achieves up to 97%
FPU utilization when running a 256 × 256 double precision
matrix multiplication on sixteen lanes. Ara runs at 1.2 GHz in
the typical corner (TT/0.80 V/25 ◦C), achieving a performance up
to 34 DP−GFLOPS. In terms of energy efficiency, Ara achieves
up to 67 DP−GFLOPS/W under the same conditions, which is
56% higher than similar vector processors found in literature.
An analysis on several vectorizable linear algebra computation
kernels for a range of different matrix and vector sizes gives
insight into performance limitations and bottlenecks for vector
processors and outlines directions to maintain high energy effi-
ciency even for small matrix sizes where the vector architecture
achieves suboptimal utilization of the available FPUs.
Index Terms—Vector processor, SIMD, RISC-V.
I. INTRODUCTION
THE end of Dennard scaling caused the race for perfor-mance through higher frequencies to halt more than a
decade ago, when increasing integration densities stopped
translating into proportionate increases in performance or
energy efficiency [1]. Processor frequencies plateaued, inciting
interest in parallel multi-core architectures. These architectures,
however, do not address the efficiency limitation created by
fetching and decoding elementary instructions, which only keep
the processor datapath busy for a very short time. Moreover,
power dissipation limits how many cores can be turned
on simultaneously. Coarse-grain multi-core architectures are
running out of steam, and core-count scaling via Moore’s law
has further slowed down [2].
Due to power limitations, modern systems must be ever more
energy efficient [3], [4]. In instruction-based programmable
architectures, the key challenge is how to mitigate the Von
Neumann bottleneck [5]. Despite the flexibility of multi-core
designs, they fail to exploit the regularity of data-parallel
applications. Each core tends to execute the same instructions
many times—a waste in terms of both area and energy [6]. By
reducing the area and energy involved in fetching and decoding
instructions and managing the instruction stream (control flow),
more area and energy can be used for the actual computation.
∗Integrated Systems Laboratory of ETH Zurich, Zurich, Switzerland.
†Department of Electrical, Electronic, and Information Engineering Guglielmo
Marconi of the University of Bologna, Bologna, Italy. E-mail: {matheusd,
fschuiki, zarubaf, mschaffner, lbenini} at iis.ee.ethz.ch.
The strong emergence of massively data-parallel workloads,
such as data analytics and machine learning [7], created a
major window of opportunity for architectures that, unlike multi-
core designs, effectively exploit data parallelism to achieve
energy efficiency. The most successful of these architectures
are General Purpose Graphics Processing Units (GPUs) [8],
which heavily leverage data-parallel multithreading to relax
the Von Neumann bottleneck through the so-called single
instruction, multiple thread (SIMT) approach [9]. GPUs are
now dominating the energy efficiency race, being present in
70% of the Green500 list [10]. They are also highly successful
as data-parallel accelerators in high-performance embedded
applications, such as self-driving cars [11].
The quest for extreme energy efficiency in data-parallel
execution has also revamped interest on vector architectures.
This kind of architecture was cutting edge at the time of another
technology scaling crisis, namely of Emitter-Coupled Logic
(ECL) based circuits [12]. Today, designers and architects are
reconsidering vector processing approaches, as they promise to
address the Von Neumann bottleneck very effectively, providing
better energy efficiency than a general-purpose processor
for applications that fit the vector processing model [6]. A
single vector instruction can be used to express a data-parallel
computation on a very large vector, thereby amortizing the
instruction fetch and decode overhead. The effect is even more
pronounced than for SIMT architectures, where instruction
fetches are only amortized over the number of parallel scalar
execution units in a “processing block”: for the most recent
NVIDIA Volta GPUs, such blocks are only 32 elements
long [13]. Thus, vector processors provide a notably effective
model to efficiently execute the data parallelism of scientific
and matrix-oriented computations [14], [15], such as digital
signal processing and machine learning algorithms.
The renewed interest in vector processing is reflected by
the introduction of vector instruction extensions in all popular
Instruction Set Architectures (ISAs), such as the proprietary
ARM ISA [16] and the open-source RISC-V ISA. In this paper,
we set out to analyze the scalability and energy efficiency of
vector processors by designing and implementing a RISC-
V-based architecture in an advanced Complementary Metal-
Oxide-Semiconductor (CMOS) technology. The design will
be open-sourced under a liberal license as part of the PULP
Platform1. The key contributions of this paper are:
1) The architecture of a parametric in-order high-performance
1See https://pulp-platform.org/.
ar
X
iv
:1
90
6.
00
47
8v
1 
 [c
s.A
R]
  2
 Ju
n 2
01
9
264-bit vector unit based on the version 0.5 draft of RISC-
V’s vector extension [17]. The vector processor was
designed for a memory bandwidth per peak performance
ratio of 2 B/DP−FLOP, and works in tandem with an open-
source application-class RV64GC scalar core. The unit
supports mixed-precision arithmetic with double, single,
and half-precision floating point operands.
2) Performance analysis on key data-parallel kernels, both
compute- and memory-bound, for variable problem sizes
and design parameters. The performance is shown to meet
the roofline achievable performance boundary, as long as
the vector length is a few times longer than the number
of physical lanes.
3) An architectural exploration and scalability analysis of
the vector processor with post-implementation results ex-
tracted from GLOBALFOUNDRIES 22FDX Fully Depleted
Silicon on Insulator (FD-SOI) technology.
4) Insights on performance limitations and bottlenecks, for
both the proposed architecture and for other vector pro-
cessors found in the literature.
This paper is organized as follows. In Section II we present
some background and related work with the architectural models
most commonly used to explore data parallelism. Then, in
Section III, we present the architecture of our vector processor.
Section IV presents the benchmarks we used to evaluate our
vector unit. Section V analyses how our vector unit explores
High-Performance Computing (HPC) workloads in terms of
performance, while Section VI analyzes implementation results
in terms of power and energy efficiency. Finally, Section VII
concludes the paper and outlines future research directions.
II. BACKGROUND AND RELATED WORK
Flynn’ seminal papers [18] on a taxonomy for computer
organization discussed two models to explore data parallelism,
namely single instruction, multiple data (SIMD) and multiple
instruction, multiple data (MIMD). More recent architectures
follow SIMT and vector thread (VT) models, which can be
conceptually placed somewhere between SIMD and MIMD.
A. MIMD
Multi-core designs involve many processors, autonomously
executing instructions on various data. MIMD models are
highly flexible, allowing a direct vectorization of codes with
irregular control and data flow. Its flexibility becomes its demise
when the model is applied to highly regular applications. In such
a case, each core will tend to run the very same instructions,
wasting energy by redundant fetch and decode operations [6].
B. SIMD
Unlike its MIMD counterpart, a SIMD architecture shares—
and thus amortizes—the instruction fetch logic among a
multitude of identical processing units. This architectural model
can also be seen as instructions operating on vectors of operands.
The SIMD approach works well as long as the control flow is
regular enough, i.e., it is possible to formulate the problem in
terms of vector operations. Flynn also highlights the difference
between array and vector processors [18].
1) Array processors: Array processors implement a packed-
SIMD architecture. This type of processor has several inde-
pendent but identical processing elements (PEs), all operating
on commands from a shared control unit. Figure 1 shows an
execution pattern for the dummy instruction sequence “ld–mul–
add–st.” The number of PEs determines the vector length, and
the architecture can be seen as a wide datapath encompassing
all subwords handled individually by each element [19].
PE0
PE1
PE2
PE3
t
ld0
ld1
ld2
ld3
mul0
mul1
mul2
mul3
add0
add1
add2
add3
st0
st1
st2
st3
Fig. 1. Execution pattern on an array processor [18].
A limitation of such an architecture is that the vector length
is fixed. It is commonly encoded into the instruction itself,
meaning that each expansion of the vector length comes with
another ISA extension. For instance, Intel’s first version of
the Streaming SIMD Extensions (SSEs) operates on 128 bit
registers, whereas the Advanced Vector Extension (AVX) and
AVX-512 evolution operates on 256 and 512-bit wide registers,
respectively [20]. ARM provides packed-SIMD capability via
the “Neon” extension, operating on 128 bit wide registers [21].
RISC-V also supports packed-SIMD via DSP extensions [22].
2) Vector processors: Vector processors are time-multiplexed
versions of array processors, implementing vector-SIMD in-
structions. Several specialized functional units stream the micro-
operations on consecutive cycles, as shown in Figure 2. By
doing so, the number of functional units no longer constrains the
vector length, which can be dynamically configured. As opposed
to packed-SIMD, long vectors do not need to be subdivided
into fixed-size chunks, but can be issued using a single vector
instruction. Hence, vector processors are potentially more
energy efficient than an equivalent array processor since many
control signals can be kept constant throughout the computation,
and the instruction fetch cost is amortized among many cycles.
LD
MUL
ALU
ST
t
ld0 ld1 ld2 ld3
mul0 mul1 mul2 mul3
add0 add1 add2 add3
st0 st1 st2 st3
Fig. 2. Execution pattern on a vector processor [18].
The history of vector processing starts with the traditional vec-
tor machines from the sixties and seventies, with the beginnings
of the Illiac IV project [14]. The trend continued throughout the
3next two decades with some supercomputers, such as the Cray-1
in 1975 [12]. At the end of the century, however, microprocessor-
based systems approached or surpassed the performance of
vector supercomputers at much lower costs [23], due to intense
work on superscalar and Very Long Instruction Word (VLIW)
architectures. It is only recently that vector processors received
renewed interest from the scientific community.
ARM is moving into Cray-inspired processing with their
Scalable Vector Extension (SVE) [16]. The extension is based
on the vector register architecture introduced with the Cray-1,
leaving the vector length as an implementation choice (from
128 bit to 2048 bit, in 128 bit increments). It is possible to
write code agnostic to the vector length, so that different
implementations can run the same software. The first system to
adopt this extension is Fujitsu’s A64FX, at a peak performance
of 2.7 DP−TFLOPS in a 7 nm process, which is competitive
in terms of peak performance to leading-edge GPUs [24].
The open RISC-V ISA specification is also leading an effort
towards vector processing through its vector extension [17].
This extension is in active development, and, at the time of this
writing, its latest version was the 0.5 draft. When compared
with ARM SVE, RISC-V does not put any limits on the vector
length. Moreover, the extension makes it possible to trade
off the number of architectural vector registers against longer
vectors. Due to the availability of open-source RISC-V scalar
cores, together with the liberal license of the ISA itself, we
chose to design our vector processor based on this extension.
C. SIMT
Coming from the GPU domain, SIMT architectures rep-
resent an amalgamation of the flexibility of MIMD and the
efficiency of SIMD designs. While SIMD architectures apply
one instruction to multiple data lanes, SIMT designs apply one
instruction to multiple independent threads in parallel [9]. The
NVIDIA Volta GV100 GPU is a state-of-the-art example of
this architecture, with 64 “processing blocks,” called Streaming
Multiprocessors (SMs) by NVIDIA, each handling 32 threads.
A SIMD instruction exposes the vector length to the
programmer and requires manual branching control, usually
by setting flags that indicate which lanes are active for a
given vector instruction. SIMT designs, on the other hand,
allow the threads to diverge, although substantial performance
improvement can be achieved if they remain synchronized [9].
SIMD and SIMT designs also handle data accesses differently.
While SIMD designs have instructions that specify data accesses
from a contiguous region in memory, each SIMT thread
executes scalar data accesses. Modern SIMT architectures
dynamically coalesce these accesses into large contiguous
chunks to better utilize the memory subsystem. While this
simplifies the programming model to some extent, it also incurs
considerable hardware complexity and energy overhead [25].
D. Vector thread
Another compromise between SIMD and MIMD are VT ar-
chitectures [25]. Similar to SIMT designs—and unlike SIMD—
VT architectures leverage the threading concept instead of the
more rigid notion of lanes, and hence provide a mechanism
to handle program divergence. The main difference between
SIMT and VT is that in the latter the vector instructions reside
in another thread, and scalar bookkeeping instructions can
potentially run concurrently with the vector ones. This division
alleviates the problem of SIMT threads running redundant scalar
instructions that must be later coalesced in hardware. Hwacha,
based on a custom RISC-V extension, is an example of a VT
design. A recent instance of Hwacha achieves 64 DP−GFLOPS
in ST 28 nm FD-SOI technology [26].
Many vector architectures report only full-system metrics of
performance and efficiency, such as memory hierarchy or main
memory controllers. This is the case of Fujitsu’s A64FX [24].
As our focus is on the core execution engine, we will mainly
compare our vector unit with Hwacha in Section VI-B. Hwacha
is an open-sourced design architecture for which information
about the internal organization is available, allowing for a fair
quantitative comparison on a single processing engine.
III. ARCHITECTURE
In this section, we introduce the microarchitecture of Ara, a
scalable high-performance vector unit based on RISC-V’s vector
extension. As illustrated in Figure 3a, Ara works in tandem
with Ariane [27], an open-source Linux-capable application-
class core. To this end, Ariane has been extended to drive the
accompanying vector unit as a tightly coupled coprocessor.
A. Ariane
Ariane is an open-source, in-order, single-issue, 64-bit
application-class processor implementing RV64GC [27]. It
has support for hardware multiply/divide and atomic memory
operations, as well as an IEEE-compliant FPU [28]. It has been
manufactured in GLOBALFOUNDRIES 22FDX technology,
running at most at 1.7 GHz and achieving an energy efficiency
of up to 40 GOPS/W. Zaruba and Benini [27] report that the
core has a six-stage pipeline, namely Program Counter (PC)
Generation, Instruction Fetch, Instruction Decode, Issue Stage,
Execute Stage, and Commit Stage. We denote the first two
stages as Ariane’s front end, responsible for the instruction
fetch interface, and the remaining four as its back end.
Ariane needs some architectural changes to drive our vector
unit, all of them in the back end. Vector instructions are decoded
partially in Ariane’s Instruction Decoder, to recognize whether
they are vector instructions, and then completely in a dedicated
Vector Instruction Decoder inside Ara. The reason for this
split decoding is the high number of Vector Control and Status
Registers—one for each of the 32 vector registers—that are
taken into account before fully decoding such instructions.
The dispatcher controls the interface between Ara and
Ariane’s dedicated scoreboard port. Unlike Ariane, Ara executes
instructions non-speculatively. The dispatcher integrates the
speculative and non-speculative regimes by monitoring the
instructions at the top of the Re-order Buffer (ROB). When a
vector instruction reaches the top of the ROB (i.e., it is no longer
speculative), the dispatcher pushes it into the instruction queue,
together with the contents of any read registers. Ara reads from
this queue, and then either acknowledges the instruction or
propagates potential exceptions back to Ariane’s scoreboard.
4M
em
or
y 
In
te
rc
on
ne
ct
W
D
at
a 
W
id
th
 C
on
ve
rte
r
W 64
ARIANE
RV64GC
ARA
RV64V
W
Sequencer
I$
D$
OpQueue
Store Unit
VLSU
OpQueueSLDU
Ack
Scalar result
N
·2
·6
4
Operation
N
·3
·6
4
@
2
2
N
·6
4
N
·6
4
Ack
Scalar result
Load Unit
AddrGen
N
·6
4
PC
Gen
Instruction
IF ID Issue EX Commit
Sc
or
eb
oa
rd
IS
SU
E
Re
gF
ile
Re
ad
SC
O
RE
BO
A
RD
Re
gF
ile
W
rit
e
CS
R
W
rit
e
Ara front end
FPU
Multiplier
CSR Buffer
ALU
LSU
Decoder
Dispatcher
La
ne
 0
La
ne
 1
…
La
ne
 N
-1
(a) Block diagram of an Ara instance with N parallel lanes. Ara receives
its commands from Ariane, an RV64GC scalar core. The vector unit has
a main sequencer; N parallel lanes; a Slide Unit (SLDU); and a Vector
Load/Store Unit (VLSU). The memory interface is W bit wide.
Lane
Sequencer
VRF Arbiter
8·
1R
W
 S
RA
M
Bank 0
Bank 4
Bank 1
Bank 5
Bank 2
Bank 6
Bank 3
Bank 7
Operation
Ack
Scalar result
4·5
4
3·5
4
3·2
FPU MUL
OpQueue
ALU
OpQueue
4·
64
3·
64
64 64
V
LS
U
 o
pe
ra
nd
s
2·
64
SL
D
U
 o
pe
ra
nd
s
3·
64
3·
64
64 V
LS
U
64 SL
D
U
64 64
Sc
al
ar
 re
su
lt
Operation
Operand requests
LANE
8·64
8·64
(b) Block diagram of one lane of Ara. It contains a lane sequencer (handling
up to 8 vector instructions); a 16 KiB vector register file; an arbiter to
orchestrate access to the register file; ten operand queues; an integer
Arithmetic Logic Unit (ALU); an integer multiplier (MUL); and a Floating
Point Unit (FPU).
Fig. 3. Top-level block diagram of Ara.
Instructions are acknowledged as soon as Ara determines
that they will not throw any exceptions. This happens early in
their execution, usually after their decoding. Because vector
instructions can run for an extended number of cycles (as
presented in Figure 2), they may get acknowledged many cycles
before the end of their execution, potentially freeing the scalar
core to continue execution of its instruction stream.
B. Sequencer
The sequencer is responsible for keeping track of the vector
instructions running on Ara, dispatching them to the different
execution units and acknowledging them with Ariane. This
unit is the single block that has a global view of the instruction
execution progress across all lanes. Therefore, hazards among
pending instructions are resolved inside this block.
Structural hazards arise due to architectural decisions (e.g.,
shared paths between the ALU and the SLDU). They also arise
if a functional unit is not able to accept yet another instruction
due to the limited capacity of its operation queue. The sequencer
delays the issue of vector instructions until the structural hazard
has been resolved (i.e., the offending instruction completes).
The sequencer also stores information about which vector
instruction is accessing which vector register. This information
is used to determine data hazards between instructions. For
example, if a vector instruction tries to write to a vector
register that is already being written, the sequencer will flag the
existence of a write-after-write (WAW) data hazard between
them. Read-after-write (RAW), write-after-read (WAR) and
WAW hazards are handled in the same manner. Unlike structural
hazards, data hazards do not need to stall the sequencer, as
they are handled on a per-element basis downstream.
C. Slide unit
The SLDU is responsible for handling instructions that must
access all Vector Register File (VRF) banks at once. It handles,
for example, the insertion of an element into a vector, the
extraction of an element from a vector, vector shuffles, and
vector slides (vd[i] ← vs[i + slide amount]). This unit may also
be extended to support basic vector reductions, such as vector-
add and internal product. The support for vector reductions is
considered an optional feature in the current version of RISC-
V’s vector extension [17]. For simplicity, we decided not to
5support them, taking into consideration that an O(n) vector
reduction can still be implemented as a sequence of O(log n)
vector slides and the corresponding arithmetic instruction [23].
D. Vector load/store unit
Ara has a single memory port, whose width is chosen to keep
the memory bandwidth per peak performance ratio fixed at
2 B/DP−FLOP. As illustrated in Figure 3a, Ara has an address
generator, responsible for determining which memory address
will be accessed. This can either be i) unit-strided loads and
stores, which access a contiguous chunk of memory; ii) constant-
strided memory operations, which access memory addresses
spaced with a fixed offset; and iii) scatters and gathers, which
use a vector of offsets to allow general access patterns. After
address generation, the unit coalesces unit-strided memory
operations into burst requests, avoiding the need to request the
individual elements from memory. The burst start address and
the burst length are then sent to either the load or the store
unit, both of which are responsible for initiating data transfers
through Ara’s Advanced eXtensible Interface (AXI) interface.
E. Lane organization
Ara can be configured with a variable number of identical
lanes, each one with the architecture shown in Figure 3b. Each
lane has its own lane sequencer, responsible for keeping track
of up to eight parallel vector instructions. Each lane also has
a VRF and an accompanying arbiter to orchestrate its access,
operand queues, an integer ALU, an integer MUL, and an FPU.
Each lane contains part of Ara’s whole VRF and execution
units. Hence, most of the computation is contained within
one lane, and instructions that need to access all the VRF
banks at once (e.g., instructions that execute at the VLSU or
at the SLDU) use data interfaces between the lanes and the
responsible computing units. Each lane also has a command
interface attached to the main sequencer, through which the
lanes indicate they finished the execution of an instruction.
1) Lane sequencer: The lane sequencer is responsible for
issuing vector instructions to the functional units, controlling
their execution in the context of a single lane. Unlike the main
sequencer, the lane sequencers do not store the state of the
running instructions, avoiding data duplication across lanes.
They also initiate requests to read operands from the VRF. We
generate up to ten independent requests to the VRF arbiter.
Operand fetch and result write-back are decoupled from each
other. Starvation is avoided via a self-regulated process, through
back pressure due to unavailable operands. By throttling the
operation request rate, the lane sequencer indirectly limits the
rate at which results are produced. This is used to handle data
hazards, by ensuring that dependent instructions run at the same
pace: if instruction i depends on instruction j, the operands
of instruction i are requested only if instruction j produced
results in the previous cycle. There is no forwarding logic.
2) Vector register file: The VRF is at the core of every vector
processor. Because several instructions can run in parallel, the
register file must be able to support enough throughput to
supply the functional units with operands and absorb their
results. In RISC-V’s vector extension, the predicated multiply-
add instruction is the worst case regarding throughput, reading
four operands to produce one result.
Due to the massive area and power overhead of multi-ported
memory cuts, which usually require custom transistor-level
design, we opted not to use a monolithic VRF with several ports.
Instead, Ara’s vector register file is composed of a set of single-
ported (1RW) banks. The width of each bank is constrained to
the datapath width of each lane, i.e., 64 bit, to avoid subword
selection logic. Therefore, in steady state, five banks are
accessed simultaneously to sustain maximum throughput for
the predicated multiply-add instruction. Ara’s register file has
eight banks per lane, providing some margin on the banking
factor. This VRF structure (eight 64-bit wide 1RW banks) is
replicated at each lane, and all inter-lane communication is
concentrated at the VLSU and SLDU.
A multi-banked VRF raises the problem of banking conflicts,
which occur when several functional units need to access the
same bank. These are resolved dynamically with a weighted
round-robin arbiter per bank with two priority levels. Low-
throughput instructions, such as memory operations, are as-
signed a lower priority. By doing so, their irregular access
pattern does not disturb other concurrent high-throughput
instructions (e.g., floating-point instructions).
Figure 4b shows how the vector registers are mapped onto
the banks. The initial bank of each vector register is shifted in
a “barber’s pole” fashion. This avoids initial banking conflicts
when the functional units try to fetch the first element of
different vector registers, which are all mapped onto the same
bank in a pure element-partitioned approach [23] of Figure 4a.
Bank
v0
v1
v2
v3
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
. . . . . . . . . . . . . . . . . . . . . . . .
(a) Without “barber’s pole” shift.
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
0 1 2 3 4 5 67
8 9 10 11 12 13 1415
0 1 2 3 4 56 7
8 9 10 11 12 1314 15
0 1 2 3 45 6 7
8 9 10 11 1213 14 15
. . . . . . . . . . . . . . . . . . . . . . . .
(b) With “barber’s pole” shift.
Fig. 4. VRF organization inside one lane. Darker colors highlight the initial
element of each vector register vi . In a), all vector registers start at the same
bank. In b), the vector registers follow a “barber’s pole” pattern, the starting
bank being shifted for every vector register.
3) Operand queues: The multi-banked organization of the
VRF can lead to banking conflicts when several functional units
try to access operands in the same bank. Each lane has a set of
operand queues between the VRF and the functional units to
absorb such banking conflicts. There are ten operand queues:
four of them are dedicated to the FPU/MUL unit, three of them
to the ALU (two of which are shared with the SLDU), and
another three to the VLSU. Each queue is 64 bit wide and their
depth was chosen via simulation. The queue depth depends
on the functional unit’s latency and throughput, so that low-
throughput functional units, as the VLSU, require shallower
6queues than the FPUs. Queues between the functional units’
output ports and the vector register file absorb banking conflicts
on the write-back path to the VRF. Each lane has two of such
queues, one for the FPU/MUL and one for the ALU.
4) Execution units: Each lane has three execution units, an
integer ALU, an integer MUL, and an FPU, all of them operating
on a 64-bit datapath. The MUL shares the operand queues with
the FPU, and they cannot be used simultaneously. With the
exception of this constraint, vector chaining is allowed between
any execution units, as long as they are executing instructions
with regular access patterns (i.e., no vector shuffles).
It is possible to subdivide the 64-bit datapath, trading off
narrower data formats by a corresponding increase in perfor-
mance. The three execution units have a 64 bit/cycle throughput,
regardless of the data format of the computation. We developed
our multi-precision ALU and MUL, both producing 1 × 64,
2 × 32, 4 × 16, and 8 × 8 bit signed or unsigned operands. Ara
has limited support for multi-precision operations, allowing for
data promotions from 8 to 16, 16 to 32, and from 32 to 64 bit.
For the FPU, we used an open-source, IEEE-compliant, multi-
precision FPU developed by Mach et al. [28]. The FPU was
configured to support Fused Multiply-Adds (FMAs), additions,
multiplications, divisions, square roots, and comparisons. As
the integer units, the FPU has a 64 bit/cycle throughput, i.e.,
one double precision, two single precision or four IEEE
754 half-precision floating point results per cycle. Besides
IEEE 754 standard floating point formats, the FPU also supports
alternative formats, 8 and 16-bit wide. Depending on the
application, narrower formats can be used to achieve significant
energy savings compared to a wide floating-point baseline [28].
IV. BENCHMARKS
In this section, we describe the benchmarks used to evaluate
the performance of the vector unit in memory- and compute-
bound regimes.
A. Arithmetic intensity
Memory bandwidth is often a limiting factor when it comes
to processor performance, and many optimizations revolve
around scheduling memory and arithmetic operations with the
purpose of hiding memory latency. The relationship between
processor performance and memory bandwidth can be analyzed
with the roofline model [29]. This model shows the peak
achievable performance (in OP/cycle) as a function of the
arithmetic intensity I, defined as the algorithm-dependent ratio
of operations per byte of memory traffic.
Accordingly to this model, computations can be either
memory- or compute-bound [30], with the peak performance be-
ing achievable only when the algorithm’s arithmetic intensity, in
operations per byte, is higher than the processor’s performance
per memory bandwidth ratio. As detailed in Section III-D, we
configured Ara so that it is in its compute-bound regime when
the arithmetic intensity is higher than 0.5 DP−FLOP/B. The
memory bandwidth determines the slope of the performance
boundary in the memory-bound regime (being equal to βI). We
consider three benchmarks to explore the architecture instances
of the vector processor with distinct arithmetic intensities that
fully span the two regions of the roofline.
Our first algorithm is MATMUL, an n× n matrix multiplica-
tion C ← AB + C. Considering double-precision matrices, the
algorithm requires at least 32n2 bytes of memory transfers to
load matrices A, B, and C and store C back into memory. The
2n3 operations—one FMA is considered as two operations—
imply that MATMUL has an arithmetic intensity of at least
IMATMUL ≥ n16 DP−FLOP/B. (1)
We will consider matrices of size at least 16 × 16 across several
Ara instances. The roofline model points that it is possible to
achieve the system’s peak performance with these matrix sizes.
The matrix multiplication is an interesting kernel to study,
neither embarrassingly memory-bound nor compute-bound,
since its arithmetic intensity grows with O(n). Nevertheless, it
is interesting to see how Ara behaves at such extreme cases.
DAXPY, Y ← αX +Y , is a common algorithmic building block
of more complex Basic Linear Algebra Subprograms (BLAS)
routines. Considering vectors of length n, the algorithm requires
the load of vectors X and Y and the write-back of vector Y,
a total of 24n bytes of memory transfers. The n FMAs of
DAXPY imply an arithmetic intensity of 1/12 DP−FLOP/B,
characterizing a heavily memory-bound algorithm.
We explore the extremely compute-bound spectrum with
the tensor convolution DCONV, a routine which is at the core
of convolutional networks. In terms of size, we took the first
layer of GoogLeNet [31], with a 64 × 3 × 7 × 7 kernel and
3 × 112 × 112 input images. Each point of the input image
must be convoluted with the weights, resulting in a total
of 64 × 3 × 7 × 7 × 112 × 112 FMAs, or 236 DP−MFLOP. In
terms of memory, we will consider that the input matrix
(after padding) is loaded exactly once, or 3 × 118 × 118 double
precision loads, together with the write-back of the result, or
64 × 112 × 112 double precision stores. The 6.44 MiB of mem-
ory transfers imply an arithmetic intensity of 34.9 DP−FLOP/B,
making this kernel heavily compute-bound on Ara.
B. Implementation and execution of a matrix multiplication
We will analyze in depth the implementation and execution
of the n × n matrix multiplication. Our implementation uses a
tiled approach working on b rows of matrix C at a time. For
simplicity, we consider that the machine’s vector length is less
or equal than n, so that one row of C fits in one vector register,
and that b divides n. We assume the matrices are stored in
row-major order. Figure 5 presents the matrix multiplication
algorithm, working on tiles of size b × n. There are three
distinct phases of the computation: I) read a block of matrix
C; II) the actual computation of the matrix multiplication, and;
III) write the block of matrix C. Phases I and III take O(n)
cycles, whereas the phase II takes O(n2) cycles.
The core part of Figure 5 is the for loop of line 8, where most
of the time is spent and where the FPUs are used. Listing 1
shows the resulting RISC-V vector assembly code for the phase
II of the matrix multiplication, considering a block size of four
rows. We ignore some control flow instructions at the start and
end of Listing 1, which handle the outer for loop.
71: r ← 0
2: while r , n do
3: for j ← 0 to b − 1 do {Phase I}
4: Load row r + j of matrix C into vector register vC j ;
5: end for
6: for i ← 0 to n − 1 do {Phase II}
7: Load row i of matrix B into vector register vB;
8: for j ← 0 to b − 1 do
9: Load element A[ j, i];
10: Broadcast A[ j, i] into vector register vA;
11: vC j ← vAvB + vC j ;
12: end for
13: end for
14: for j ← 0 to b − 1 do {Phase III}
15: Store vector register vC j into row r + j of matrix C;
16: end for
17: r ← r + b
18: end while
Fig. 5. Algorithm for the matrix multiplication C ← AB +C.
Listing 1
EXCERPT OF THE PHASE II OF THE MATRIX MULTIPLICATION
IMPLEMENTATION IN RISC-V VECTOR EXTENSION ASSEMBLY,
WITH A BLOCK SIZE OF FOUR ROWS OF MATRIX C .
1 ; a0: pointer to A a1: pointer to B
2 ; a2: A row size a3: B row size
3
4 vld vB0, 0(a1) ; load row of B
5 add a1, a1, a3 ; bump B pointer
6
7 vld vB1, 0(a1) ; load row of B
8 add a1, a1, a3 ; bump B pointer
9 ld t0, 0(a0) ; / load element of A
10 add a0, a0, a2 ; | bump A pointer
11 vins vA, t0, zero ; | move from Ariane to Ara
12 vmadd vC0, vA, vB0, vC0 ; \ vector multiply-add
13 ld t0, 0(a0)
14 add a0, a0, a2
15 vins vA, t0, zero
16 vmadd vC1, vA, vB0, vC1
17 ...
18 vins vA, t0, zero
19 vmadd vC3, vA, vB0, vC3
20
21 vld vB0, 0(a1) ; load row of B
22 add a1, a1, a3 ; bump B pointer
23 ld t0, 0(a0) ; / load element of A
24 add a0, a0, a2 ; | bump A pointer
25 vins vA, t0, zero ; | move from Ariane to Ara
26 vmadd vC0, vA, vB1, vC0 ; \ vector multiply-add
27 ld t0, 0(a0)
28 add a0, a0, a2
29 vins vA, t0, zero
30 vmadd vC1, vA, vB1, vC0
31 ...
32 vins vA, t0, zero
33 vmadd vC3, vA, vB1, vC3
After loading one row of matrix B, the kernel consists of
four repeating instructions, responsible for, respectively: i) load
the element A[ j, i] into a general-purpose register t0; ii) bump
address A[ j, i] preparing for next iteration; iii) broadcast scalar
register t0 into vector register vA; iv) multiply-add instruction
vCi ← vAvB + vCi . As Ariane is a single-issue core, this kernel
runs in at least four cycles. In steady state, however, we measure
that each loop iteration runs in five cycles. The reason for this,
as shown in the pipeline diagram of Figure 6, is one bubble
due to the data dependence between the scalar load (which
takes two cycles) and the broadcast instruction.
.
Instruction Cycle
1 2 3 4 5 6 7 8
LD IS EX EX CO
ADD IS EX CO
VINS — IS EX EX CO
VMADD IS EX EX CO
LD IS EX EX
Fig. 6. Pipeline diagram of the matrix multiplication kernel. Only three
pipeline stages are highlighted: IS is Instruction Issue, EX is Execution Stage,
CO is Commit Stage. Ariane has two commit ports into the scoreboard.
We used loop unrolling and software pipelining to code the
algorithm of Figure 5 as our C implementation. The use of
these techniques to improve performance is visible in Listing 1.
We unrolled of the for loop of line 7 in Figure 5, which
correspond to lines 9-12, repeated b times on the following
lines in Listing 1. This avoids any branching at the end of
the loop. Moreover, two vectors hold rows of matrix B. This
double buffering allows for the simultaneous loading of one
row in vector vB1, in line 7, while vB0 is used for the FMAs,
as in line 12 in Listing 1. After line 21, vB1 is used for the
computation, while another row of B is loaded into vB0.
The three phases of the computation can be distinguished
clearly in Figure 7, which shows the utilization of the VLSU
and FPU for a 32 × 32 matrix multiplication on a four-lane Ara
instance. Note how the FPUs are almost fully utilized during
phase II, while being almost idle otherwise.
0
50
100
LD
0
50
100
U
til
iz
at
io
n
[%
]
FPU
0 2 4 6 8 10
0
50
100
Time [×103 cycles]
ST
Fig. 7. Utilization of Ara’s functional units for a 32 × 32 matrix multiplication
on an Ara instance with four lanes.
8V. PERFORMANCE ANALYSIS
In this section, we analyze Ara in terms of its peak
performance across several design parameters. We use the
matrix multiplication kernel to explore architectural limitations
in depth, before analyzing how such limitations manifest
themselves for the other kernels.
A. Matrix multiplication
Figure 8 shows the performance measurements of the matrix
multiplication C ← AB + C, for several Ara instances and
problem sizes n × n.
0.25 0.5 1 2 4 8 16 32
2
4
8
16
32
n 16 32 64 128 256
[24.5%]
[35.8%]
[14.5%]
[17.4%]
[31.0%]
[10.2%]
[10.4%]
[22.5%]
[43.0%]
[5.2%]
[5.8%]
[6.9%]
[21.2%]
[1.8%]
[1.9%]
[2.5%]
[2.8%]
Iss
ue
rat
e
Arithmetic intensity [DP−FLOP/B]
Pe
rf
or
m
an
ce
[D
P−
FL
O
P/
cy
cl
e]
` = 2 ` = 4 ` = 8 ` = 16
Fig. 8. Performance results for the matrix multiplication C ← AB +C, with
different number of lanes `, for several n × n problem sizes. The bold red
line depicts a performance boundary due to the instruction issue rate. The
numbers between brackets indicate the performance loss, with respect to the
theoretically achievable peak performance.
As we can see, for problems large enough, the performance
measurements meet the peak performance boundary. For a
matrix multiplication of size 256 × 256, we utilize the FPUs for
98% of the time for an Ara instance with two lanes and for 97%
for 16 lanes. The performance scalability comes, however, at a
price. More lanes require larger problem sizes to fully exploit
the maximum performance, even though all problem sizes fall
into the compute-bound regime. Smaller problems, however,
cannot fully utilize the functional units. It is important to note
that this limiting effect can also be observed in other vector
processors such as Hwacha (see comparison in Section V-D).
This effect is attributed to two main reasons: first, the initial-
ization of the vector register file before starting computation;
and second, the rate at which the vector instructions are issued
to Ara. The former is represented by phases I and III of
the computation as analyzed in Section IV-B. The latter is
related to the rate at which the vector FMA instructions are
issued. To understand this phenomenon, consider that smaller
vectors occupy the pipeline for fewer cycles, and more vector
instructions are required to fully utilize the FPUs. If every
vector FMA instruction occupies the FPUs for τ cycles and
they are issued every δ cycles, the system performance $ is
limited by
$ ≤ Π τ
δ
. (2)
For the n × n matrix multiplication, τ is equal to 2n/Π, with
this ratio being possibly lower than one if the matrix size is
larger than the number of parallel lanes. We use Equation (1)
to rewrite this constraint in terms of the arithmetic intensity
IMATMUL, resulting in
$ ≤ 32
δ
IMATMUL. (3)
This translates to another performance boundary in the roofline
plot, purely dependent on the instruction issue rate. As analyzed
in Section IV-B, the FMA instructions are issued every five
cycles. This shifts the roofline of the architecture as illustrated
with the bold line in Figure 8. Note that, for 16 lanes, even a
64 × 64 matrix multiplication is limited by the issue rate.
The performance degradation with shorter vectors could
be mitigated in several ways. One option would be a more
complex instruction issue mechanism, either going superscalar
or introducing a VLIW capable ISA to increase the issue rate.
Shorter vectors bring vector processors to a regime closer to
the regime of an array processor, where the vector instructions
execute for a single cycle. This puts pressure on the issue logic,
demanding more than a simple single-issue in-order core. For
example, all ARM Cortex-A cores with Neon capability are
also superscalar [32]. Another alternative would be the use
of an MIMD approach where the lanes would be decoupled,
running instructions issued by different scalar cores. This
solution, however, increases instruction traffic and duplicates
the instruction issue logic, which degrades the energy efficiency.
B. AXPY
As discussed in Section IV-A, DAXPY is a heavily memory-
bound kernel, with an arithmetic intensity of 0.083 DP−FLOP/B.
It is no surprise that the measured performance for such a
kernel are much less than the system’s peak performance
in the compute-bound region. For an Ara instance with
two lanes, we measure 0.65 DP−FLOP/cycle, which is 98%
of the theoretical performance limit. For sixteen lanes, the
achieved 4.27 DP−FLOP/cycle is still 80% of the theoretical
limit βIDAXPY from the roofline plot. The limiting factor is the
configuration of the vector unit, whose overhead increases the
runtime from the ideal 96 cycles to 120 cycles.
C. Convolution
Convolutions are heavily compute-bound kernels, with an
arithmetic intensity up to of 34.9 DP−FLOP/B. With two lanes,
it achieves a performance up to 3.73 DP−FLOP/cycle. We no-
tice some performance degradation for sixteen lanes, where the
kernel achieves 26.7 DP−FLOP/cycle, i.e., an FPU utilization
of 83.2%, close to the performance achieved by the 128 × 128
matrix multiplication. The reason for the performance drop at
both kernels lies at the problem size. In this case, each lane
holds only seven elements of the 112-element long vectors, i.e.,
the vectors do not even occupy the eight banks. With such short
instructions the system does not have enough time to achieve the
steady state banking access pattern discussed in Section III-E2.
Such short instructions incur into banking conflicts that would
otherwise be amortized across longer vectors.
9Figure 9 shows the performance results for the three
considered benchmarks. In both memory- and compute-bound
regions, the achieved performance tend to meet the roofline
boundary, for all the considered architecture instances.
0.5
1
2
4
8
16
32
D
A
X
PY
M
A
T
M
U
L
D
C
O
N
V0.
5
0.
12
5 2 8
[4.0%]
[6.2%]
[12%]
[20%]
[1.8%]
[1.9%]
[2.5%]
[2.8%]
[6.7%]
[7.8%]
[9.4%]
[17%]
Arithmetic intensity [DP−FLOP/B]
Pe
rf
or
m
an
ce
[D
P−
FL
O
P/
cy
cl
e]
` = 2 ` = 4 ` = 8 ` = 16
Fig. 9. Performance results for the three considered benchmarks, with different
number of lanes `. The three kernels use AXPY uses vectors of length
256, the MATMUL is between matrices of size 256 × 256, and CONV uses
GoogLeNet’s sizes. The numbers between brackets indicate the performance
loss, with respect to the theoretically achievable peak performance.
D. Performance comparison with Hwacha
For comparison with Ara, we measured Hwacha’s perfor-
mance for the matrix multiplication benchmark, using the
publicly available Hardware Description Language (HDL)
sources and tooling scripts from their GitHub repository2.
We were not able to reproduce the 32 × 32 double precision
matrix multiplication performance claimed by Dabbelt et al. [6].
The reason for this is that Hwacha relies on a closed-source
L2 cache, whereas its public version has a limited memory
system with no banked cache and with a broadcast hub to
ensure coherence. The lack of a banked cache effectively limits
Hwacha’s memory bandwidth to 128 bit/cycle, starving the
FMA units and severely limiting the achievable performance.
Table I brings the performance achieved by Ara and the
published results for Hwacha [6] side by side. For a fair
comparison, the roofline performance boundaries are identical
between the compared architectures. For small problems, for
which a direct comparison is possible, Ara utilizes its FPUs
much better than the equivalent Hwacha instances. For the
instances with two lanes, Ara utilizes its FPUs 66% more than
the equivalent Hwacha instance, for a relatively small 32 × 32
matrix multiplication. Moreover, we note that both Ara and
Hwacha operate at a similar architectural design point in the
sense that they are coupled to a single-issue in-order core.
Therefore, Hwacha exhibits a similar performance degradation
on small matrices and vector lengths as previously described
2See https://github.com/ucb-bar/hwacha-template/tree/a5ed14a.
for Ara in Section V-A. For what concerns large problems,
another more recent reference on Hwacha [26] claims a 95%
FPU utilization for a 128 × 128 MATMUL, which is close to
the performance level that Ara achieves. However, these results
cannot be reproduced on the current open-source version of
Hwacha, possibly because of the memory system limitation
outlined above.
TABLE I
NORMALIZED ACHIEVED PERFORMANCE BETWEEN EQUIVALENT ARA
AND HWACHA INSTANCES FOR A MATRIX MULTIPLICATION , WITH
DIFFERENT n × n PROBLEM SIZES .
Π 8 DP−FLOP/cycle 16 DP−FLOP/cycle 32 DP−FLOP/cycle
n Ara Hwacha1 Ara Hwacha Ara Hwacha
16 49.5% — 25.4% — 12.8% —
32 82.6% 49.9% 53.4% 35.6% 27.6% 22.4%
64 89.6% — 77.5% — 45.6% —
128 94.3% — 93.1% — 78.8% —
1Performance results extracted from [6].
VI. IMPLEMENTATION RESULTS
In this section, we analyze the implementation of several
Ara instances, in terms of area, power and energy efficiency.
A. Methodology
Ara was synthesized for GLOBALFOUNDRIES’ 22FDX FD-
SOI technology using Synopsys Design Compiler 2017.09. The
back-end design flow was carried out with Cadence Innovus
18.11.000. For this technology, one gate equivalent (GE) is equal
to 0.199 µm2. Ara’s performance and power figures of merit
are measured running the kernels on a cycle-accurate Register
Transfer Level (RTL) simulation, back annotated with timing
information from the synthesized design at TT/0.80 V/25 ◦C.
We used Synopsys PrimeTime 2016.12 to extract the power
figures. Table II summarizes Ara’s design parameters.
TABLE II
DESIGN PARAMETERS .
# Lanes ` ∈ [2, 4, 8, 16]
Memory width 32` bit
Operating corner TT/0.80 V/25 ◦C
Target frequency 1 GHz
V
R
F
Size 16 KiB/lane
# Banks 8 bank/lane
Bank width 64 bit
B. Synthesis results
Table III summarizes the synthesis results of several Ara
instances at the typical corner (TT/0.80 V/25 ◦C). Overall, the
instances achieve operating frequencies around 1.2 GHz.
At the worst-corner, SS/0.72 V/125 ◦C, the system can be
clocked at a frequency up to about 0.95 GHz. Because the
maximum frequencies achieved after synthesis are usually
higher than the ones achieved after the back-end flow, the
system was synthesized for a clock period constraint 250ps
10
TABLE III
POST-SYNTHESIS ARCHITECTURAL COMPARISON BETWEEN SEVERAL ARA INSTANCES IN GLOBALFOUNDRIES 22FDX TECHNOLOGY IN TERMS OF
PERFORMANCE , POWER CONSUMPTION , AND ENERGY EFFICIENCY IN THE TYPICAL CORNER (TT/0.80 V/25 ◦C).
Instance
Figure of merit ` = 2 ` = 4 ` = 8 ` = 16
Clock [GHz] 1.3 1.3 1.2 1.1
Area [kGE] 2082 3188 5436 10 001
Logic/Macros [kGE] 845/694 1356/1051 2436/1764 4651/3191
Area per lane [kGE] 1081 797 679 625
Kernel matmul1 dconv2 daxpy3 matmul dconv daxpy matmul dconv daxpy matmul dconv daxpy
Performance [DP−GFLOPS] 5.11 4.85 0.85 10.2 9.59 1.62 18.7 17.4 2.88 34.3 29.3 4.70
Core power [mW] 128 128 65.9 220 180 109 319 260 171 510 448 258
Leakage [mW] 6.5 9.4 15 27
Ariane/Ara [mW] 32/96 33/95 23/43 35/185 34/146 25/84 28/290 27/233 21/150 33/477 31/418 22/236
Core power per lane [mW] 64 64 33.0 54.9 45.1 27.2 39.9 32.4 21.4 31.9 28 16.2
Efficiency [DP−GFLOPS/W] 39.9 30.8 12.4 46.4 53.3 14.9 58.6 66.9 16.8 67.3 65.4 18.2
1Double precision floating point 256 × 256 matrix multiplication. 2Double precision floating point tensor convolution with sizes from the first layer of
GoogLeNet. Input size is 3 × 112 × 112 and kernel size is 64 × 3 × 7 × 7. 3Double precision AXPY of vectors with length 256.
shorter than the target clock period of 1 ns. The frequency
results of Table III take this into account, and the frequencies
listed indicate what we expect after a back-end run. The system
can be tuned for even higher frequencies by deploying Forward
Body-Biasing (FBB) techniques, at the expense of an increase
in leakage power. In average, the final designs have a mix of
72.9% Low Voltage Threshold (LVT) cells and 27.1% Super
Low Voltage Threshold (SLVT) cells.
The two-lane instance has its critical path inside the double
precision FMA. This block relies on the automatic retiming
feature from Synopsys Design Compiler, and the register
placement could be further improved by hand-tuning, or by
increasing the number of pipeline stages. Another critical path
is on the combinational handshake between the VLSU and
its operand queues in the lanes. Both paths are about 40 gate
delays long. Timing of the instances with eight and sixteen
lanes becomes increasingly critical, due to the widening of
Ara’s memory interface. This happens when the VLSU collects
64 bit words from all the lanes, realigns and packs them into a
wide word to be sent to memory. The instance with 16 lanes
incurs into a 15% clock frequency penalty, when compared
with the frequency achieved by the instance with two lanes.
The silicon area and leakage power of the accompanying
scalar core are amortized among the lanes, which can be seen
with the decreasingly area per lane figure of merit. Figure 10
shows the area breakdown of an Ara instance with four lanes.
Ara’s total area (excluding the scalar core) is 2.29 MGE, out of
which each lane amounts to 533 kGE. The area of the vector
unit is dominated by the lanes, while the other blocks amount
to only 7% of the total area. The area of the lanes is dominated
by the VRF (35%), the FPU (27%), and the multiplier (18%).
In terms of logic area, a Hwacha instance with four lanes uses
0.354 mm2 [6], or 1098 kGE, which is 19% smaller than the
b
a
ALULane sequencer
VLSU
SLDU
Sequencer
Front end
Lane 3Lane 2Lane 1
FPU
Lane 0
MULQueueVRF
Fig. 10. Area breakdowns of a) an Ara instance with four lanes with detail
on b) one of its lanes. Ara’s total area, excluding the scalar processor, is
2.29 MGE. Each lane has about 533 kGE.
equivalent Ara instance3. The trend is also valid for equivalent
instances with eight and sixteen lanes. The main reason for
this area difference is that Hwacha has only half as many
integer multipliers as Ara, i.e., Hwacha has one MUL per two
FMA units [33]. These multipliers make up for 9% of the
area difference. Moreover, unlike Ara, these specific Hwacha
instances do not support mixed-precision arithmetic [6], and
its support would incur into a 4% area overhead [34].
We used the synthesized designs to analyze the performance
and energy efficiency of Ara when running the considered
benchmarks. Due to the asymmetry between the code that
runs in Ariane and in Ara, we decided to extract switching
activities by running the benchmarks in the netlists back
annotated with post-synthesis timing information. As expected,
the energy efficiency of Ara coupled to an Ariane core is
considerably higher than that of an Ariane core alone. For
instance, a 256 × 256 integer matrix multiplication achieves up
to 53.4 GOPS/W energy efficiency on an Ara with four lanes,
3As Dabbelt et al. [6] do not specify the technology they used, we considered
an ideal scaling from 28 nm to 22 nm. Therefore, we considered one gate-
equivalent in 28 nm to be (28/22)2 bigger than one gate-equivalent in 22 nm,
or 0.322 µm2.
11
whereas a comparable benchmark runs with 17 GOPS/W on
Ariane [27]. In that case, the instruction and data caches alone
are responsible for 46% of Ariane’s power dissipation. In Ara’s
case, most of the memory accesses go directly into the VRF
and energy spent for cache accesses can be amortized over
many vector lanes and cycles, increasing the system’s energy
efficiency with the number of lanes. Overall, by amortizing the
power dissipation across an increasing number of lanes, while
maintaining a good FPU utilization and with a moderate speed
degradation, Ara’s energy efficiency increases to 68% when
comparing the architectural instances with two lanes with the
one with sixteen lanes.
A Hwacha implementation in ST 28 nm FD-SOI technology
(with an undisclosed number of lanes) achieves a peak energy
efficiency of 40 DP−GFLOPS/W [26]. Adjusting for technology
scaling gains [1], an energy efficiency of 43 DP−GFLOPS/W is
lower than the energy efficiency of all comparable Ara instances
running MATMUL. With the hypothesis based on the design
floorplan that this Hwacha implementation has eight lanes, Ara
achieves an energy efficiency 36% higher than Hwacha.
C. Physical implementation
An Ara instance with four lanes was placed and routed as a
0.75 mm×1.30 mm macro in GLOBALFOUNDRIES 22FDX FD-
SOI technology, using Cadence Innovus 18.11.000. Figure 11
shows the final implemented result, highlighting its internal
blocks. Without its caches, Ariane uses about the same area
(474 kGE) than one lane, including its VRF.
Our vector processor is scalable, in the sense that Ariane can
be reused without changes to drive a wide range of different
lane parameters. Furthermore, each identical vector lane touches
only its own section of the VRF, hence not introducing any
scalability bottlenecks. A testimony of its scalability are the
energy efficiency and power per lane figures of merit, which
improve with an increasing number of lanes, for all considered
benchmarks. Scalability is only limited by the units that need
to interface with all lanes at once, namely the main sequencer,
the VLSU, and the SLDU.
Unlike our architecture, Hwacha has several VLSUs, each one
128 bit wide and serving four lanes. This solves the scalability
issue locally, by controlling the growth of the memory interface.
Such a solution, however, pushes the memory interconnect issue
further upstream, as its wide memory system must be able to
aggregate multiple parallel requests from all these VLSUs to
achieve their maximum memory throughput.
VII. CONCLUSIONS
In this work, we presented Ara, a parametric in-order high-
performance energy-efficient 64-bit vector unit based on the
version 0.5 draft of RISC-V’s vector extension. Ara acts as a
coprocessor tightly coupled to Ariane, an open-source Linux-
capable application-class RV64GC core.
Ara’s microarchitecture was designed with scalability in mind.
To this end, it is composed of a set of identical lanes, each
hosting part of the system’s vector register file and functional
units. The lanes communicate with each other via the VLSU
and the SLDU, responsible for executing instructions that touch
A
B
C
D
E F
G
H I J
(a) Place-and-route results of an Ara instance with four lanes, highlighting
its internal blocks: A) lane 0; B) lane 1; C) lane 2; D) lane 3; E) SLDU; F)
sequencer; G) VLSU; H) Ara front end; I) Ariane; J) memory interconnect.
A
B
CD
E F
(b) Detail of one of Ara’s lanes, highlighting its internal blocks: A) lane
sequencer; B) VRF; C) operand queues; D) MUL; E) FPU; F) ALU.
Fig. 11. Place-and-route results of an Ara instance with four lanes in
GLOBALFOUNDRIES 22 nm technology on a 0.75 mm × 1.30 mm macro.
all the VRF banks at once. These units arguably represent the
weak points when it comes to scalability, because they get wider
with an increasing number of lanes. Hwacha takes an alternative
approach, having several narrow memory ports instead of a
single wide one. This approach, however, does not solve the
scalability problem, but just deflects it further to the memory
interconnect and cache subsystem.
We presented results for Ara configurations with two up to
sixteen lanes in GLOBALFOUNDRIES 22FDX technology, and
showed that we achieve clock frequencies around 1.2 GHz
in the typical corner. Our post-synthesis results indicate
that, in terms of energy efficiency, our design is 2.7× more
energy efficient than Ariane alone, running an equivalent
benchmark. An eight-lane instance of our design achieves up
to about 59 DP−GFLOPS/W running computationally intensive
benchmarks, which is 36% more efficient than the equivalent
Hwacha implementation, at 43 DP−GFLOPS/W.
We measured the performance of Ara using matrix multiplica-
tion, convolution (both compute-bound), and AXPY (memory-
bound) double-precision kernels. For problems “large enough,”
the compute-bound kernels almost saturate the FPUs, with the
12
measured performance of a 256 × 256 matrix multiplication
only 3% below the theoretically achievable peak performance.
We also observe a performance degradation for problems
whose size is comparable to the number of vector lanes. This is
not a limitation of Ara per se, but rather of vector processors in
general, when coupled to a single-issue in-order core. The main
reason for the low FPU utilization for small problems is the rate
at which the scalar core issues vector instructions. The shorter
the vector length is, the more vector instructions are required to
fill the pipeline. With our MATMUL implementation, Ariane
issues a vector FMA instruction every five cycles.
To this end, we believe that it would be interesting to
investigate whether and to what extent this performance limit
could be mitigated by leveraging a superscalar or VLIW-capable
core to drive the vector coprocessor. Another approach could
be the use of multiple small and simple cores that drive the
individual vector lanes. In any case, care must be taken in
order not to degrade the energy efficiency of the design.
ACKNOWLEDGEMENTS
We would like to thank Francesco Conti and Frank
Gu¨rkaynak for the discussions and insights.
REFERENCES
[1] R. G. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge,
“Near-threshold computing: Reclaiming Moore’s law through energy
efficient integrated circuits,” Proceedings of the IEEE, vol. 98, no. 2, pp.
253–266, Feb. 2010.
[2] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and
D. Burger, “Dark silicon and the end of multicore scaling,” IEEE Micro,
vol. 32, no. 3, pp. 122–134, May 2012.
[3] I. Hwang and M. Pedram, “A comparative study of the effectiveness of
CPU consolidation versus dynamic voltage and frequency scaling in a
virtualized multicore server,” IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, vol. 24, no. 6, pp. 2103–2116, Jun. 2016.
[4] S. Kiamehr, M. Ebrahimi, M. S. Golanbari, and M. B. Tahoori,
“Temperature-aware dynamic voltage scaling to improve energy efficiency
of near-threshold computing,” IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, vol. 25, no. 7, pp. 2017–2026, Jul. 2017.
[5] J. Backus, “Can programming be liberated from the von Neumann style?:
A functional style and its algebra of programs,” Commun. ACM, vol. 21,
no. 8, pp. 613–641, Aug. 1978.
[6] D. Dabbelt, C. Schmidt, E. Love, H. Mao, S. Karandikar, and K. Asanovic´,
“Vector processors for energy-efficient embedded systems,” in Proceedings
of the Third ACM International Workshop on Many-core Embedded
Systems, ser. MES ’16. New York, NY, USA: ACM, 2016, pp. 10–16.
[7] V. Sze, Y. Chen, T. Yang, and J. S. Emer, “Efficient processing of deep
neural networks: A tutorial and survey,” Proceedings of the IEEE, vol.
105, no. 12, pp. 2295–2329, Dec. 2017.
[8] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C.
Phillips, “GPU computing,” Proceedings of the IEEE, vol. 96, no. 5, pp.
879–899, May 2008.
[9] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “NVIDIA Tesla:
A unified graphics and computing architecture,” IEEE Micro, vol. 28,
no. 2, pp. 39–55, Mar. 2008.
[10] Green500, “Green500 list - November 2018,” Nov. 2018. [Online].
Available: https://www.top500.org/green500/lists/2018/11/
[11] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal,
L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao,
and K. Zieba, “End to end learning for self-driving cars,” CoRR, 2016.
[Online]. Available: http://arxiv.org/abs/1604.07316
[12] R. M. Russell, “The CRAY-1 computer system,” Commun. ACM, vol. 21,
no. 1, pp. 63–72, Jan. 1978.
[13] NVIDIA Tesla V100 GPU Architecture, NVIDIA, Aug. 2017,
v1.1. [Online]. Available: https://images.nvidia.com/content/volta-
architecture/pdf/volta-architecture-whitepaper.pdf
[14] M. M. Mano, C. R. Kime, and T. Martin, Logic and Computer Design
Fundamentals, 5th ed. Hoboken, NJ, USA: Pearson High Education,
2015.
[15] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantita-
tive Approach, 5th ed. San Francisco, CA, USA: Morgan Kaufmann
Publishers Inc., 2011.
[16] N. Stephens, S. Biles, M. Boettcher, J. Eapen, M. Eyole, G. Gabrielli,
M. Horsnell, G. Magklis, A. Martinez, N. Premillieu, A. Reid, A. Rico,
and P. Walker, “The ARM Scalable Vector Extension,” IEEE Micro,
vol. 37, no. 2, pp. 26–39, Mar. 2017.
[17] “Working draft of the proposed RISC-V V vector extension,” 2019,
accessed: 2019-03-01. [Online]. Available: https://github.com/riscv/riscv-
v-spec
[18] M. J. Flynn, “Some computer organizations and their effectiveness,” IEEE
Transactions on Computers, vol. C-21, no. 9, pp. 948–960, Sep. 1972.
[19] A. Peleg and U. Weiser, “MMX technology extension to the Intel
architecture,” IEEE Micro, vol. 16, no. 4, pp. 42–50, Aug. 1996.
[20] J. Reinders, “Intel AVX-512 instructions,” Intel Software Developer
Zone, Jun. 2017. [Online]. Available: https://software.intel.com/en-
us/blogs/2013/avx-512-instructions
[21] ARM, “Neon,” Accessed: 1st May 2019. [Online]. Available:
https://developer.arm.com/architectures/instruction-sets/simd-isas/neon
[22] M. Gautschi, P. D. Schiavone, A. Traber, I. Loi, A. Pullini, D. Rossi,
E. Flamand, F. K. Gu¨rkaynak, and L. Benini, “Near-threshold RISC-V
core with DSP extensions for scalable IoT endpoint devices,” IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25,
no. 10, pp. 2700–2713, Oct. 2017.
[23] K. Asanovic´, “Vector microprocessors,” Ph.D. dissertation, University of
California, Berkeley, 1998.
[24] T. Yoshida, “Fujitsu high performance CPU for the Post-K computer,”
in Hot Chips: A Symposium on High Performance Chips, ser. HC30,
Cupertino, CA, USA, Aug. 2018.
[25] Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, and
K. Asanovic´, “Exploring the tradeoffs between programmability and
efficiency in data-parallel accelerators,” SIGARCH Comput. Archit. News,
vol. 39, no. 3, pp. 129–140, 2011.
[26] C. Schmidt, A. Ou, and K. Asanovic´, “Hwacha: A data-parallel RISC-V
extension and implementation,” in Inaugural RISC-V Summit Proceedings.
Santa Clara, CA, USA: RISC-V Foundation, Dec. 2018. [Online]. Avail-
able: https://content.riscv.org/wp-content/uploads/2018/12/Hwacha-A-
Data-Parallel-RISC-V-Extension-and-Implementation-Schmidt-Ou-.pdf
[27] F. Zaruba and L. Benini, “The cost of application-class processing: Energy
and performance analysis of a Linux-ready 1.7GHz 64bit RISC-V core
in 22nm FDSOI technology,” arXiv e-prints, Apr. 2019.
[28] S. Mach, D. Rossi, G. Tagliavini, A. Marongiu, and L. Benini, “A
transprecision floating-point architecture for energy-efficient embedded
computing,” in 2018 IEEE International Symposium on Circuits and
Systems (ISCAS), May 2018, pp. 1–5.
[29] S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful
visual performance model for multicore architectures,” Commun. ACM,
vol. 52, no. 4, pp. 65–76, Apr. 2009.
[30] G. Ofenbeck, R. Steinmann, V. Caparros, D. G. Spampinato, and
M. Pueschel, “Applying the roofline model,” in IEEE International
Symposium on Performance Analysis of Systems and Software (ISPASS),
Mar. 2014, pp. 76–85.
[31] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in Computer Vision and Pattern Recognition (CVPR), 2015. [Online].
Available: http://arxiv.org/abs/1409.4842
[32] ARM, “Arm Cortex-A series processors,” Accessed: 29th
April 2019. [Online]. Available: https://developer.arm.com/ip-
products/processors/cortex-a
[33] Y. Lee, A. Ou, C. Schmidt, S. Karandikar, H. Mao, and K. Asanovic´, “The
Hwacha microarchitecture manual,” University of California at Berkeley,
Berkeley, CA, USA, Tech. Rep. UCB/EECS-2015-263, Dec. 2015.
[34] Y. Lee, C. Schmidt, S. Karandikar, D. Dabbelt, A. Ou, and K. Asanovic´,
“Hwacha preliminary evaluation results,” University of California at
Berkeley, Berkeley, CA, USA, Tech. Rep. UCB/EECS-2015-264, Dec.
2015.
