OTTER Vector Extension by Peralta, Alexis A
OTTER VECTOR EXTENSION
Senior Project Report
California Polytechnic State University,
San Luis Obispo
In Partial Fulfillment
of the Requirements for the Degree
Bachelor of Science in Computer Engineering
by
Alexis Peralta
June 2020
c© 2020
Alexis Peralta
ALL RIGHTS RESERVED
ii
TITLE: Otter Vector Extension
AUTHOR: Alexis Peralta
DATE SUBMITTED: June 2020
ADVISOR: Joseph Callenes-Sloan, Ph.D.
Professor of Electrical and Computer Engineering
iii
ABSTRACT
Otter Vector Extension
Alexis Peralta
This paper offers an implementation of a subset of the ”RISC-V ’V’ Vector Exten-
sion”, v0.7.x. The ”RISC-V ’V’ Vector Extension” is the proposed vector instruction
set for RISC-V open-source architecture. Vectors are inherently data-parallel, allow-
ing for significant performance increases. Vectors have applications in fields such as
cryptography, graphics, and machine learning. A vector processing unit was added to
Cal Poly’s RISC-V multi-cycle architecture, known as the OTTER. Computationally
intensive programs running on the OTTER Vector Extension ran over three times
faster when compared to the baseline multi-cycle implementation. Memory intensive
applications saw similar performance increases.
iv
TABLE OF CONTENTS
Page
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1 Vectors in Computer Architecture . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 OTTER RISC-V Vector Extension Overview . . . . . . . . . . . . . . . . . 4
2.1 Vector Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 OTTER Vector Extension Implementation . . . . . . . . . . . . . . . . . . 6
3.1 Vector Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Vector Register File . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.1 Vector Masking . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Vector Loads and Stores . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3.1 Vector Memory Assist . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Vector Algorithmic Logic Unit . . . . . . . . . . . . . . . . . . . . . . 11
4 Results & Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
APPENDICES
A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
v
LIST OF TABLES
Table Page
2.1 SEW vtype encoding . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.1 Array Addition Speedup Results . . . . . . . . . . . . . . . . . . . 16
4.2 memcpy Speedup Results . . . . . . . . . . . . . . . . . . . . . . . 20
A.1 OTTER Vector Extension Instructions . . . . . . . . . . . . . . . . 26
vi
LIST OF FIGURES
Figure Page
3.1 Vector length and SEW CSRs . . . . . . . . . . . . . . . . . . . . . 6
3.2 Vector layout in VRF . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 VRF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.4 Example of a vector mask stored at v0 . . . . . . . . . . . . . . . . 8
3.5 Vector Memory Assist Unit . . . . . . . . . . . . . . . . . . . . . . 10
B.1 OTTER with Vector Extension . . . . . . . . . . . . . . . . . . . . 29
vii
Chapter 1
VECTORS IN COMPUTER ARCHITECTURE
Vectors provide inherent data-level parallelism and appear frequently in high-performance
computing[10]. Vectors provide support for algorithms used in many naturally-
vectorizable applications such as multimedia extensions and processing, ”human-
machine” interfacing, sound and image processing, wireless communication, bioin-
formatics, climate modeling, cryptography, game consoles, graphics processing, and
much more [8, 5, 2, 11, 7]. The advantages of modern vector architectures even exceed
superscalar architectures for many of the aforementioned applications which exhibit
an emphasis on computing [8]. They are also often less complex to contruct and scale
with CMOS technology compared to superscalar computers [8].
A vector processor in computer architecture is a central processing unit (CPU)
which is capable of performing a single operation across multiple pieces of data
simultaneously[3]. Cray Research built the first vector processor, CRAY-1, in the
1970s[3]. Now, most compilers employ vectorization of code where possible for in-
creases in performance[10]. Vector instruction sets are highly beneficial for expressing
modern applications in data-parallel forms[11]. In particular, vector instruction sets
encompass data-parallelism in the form of single instruction multiple data (SIMD).
A single vector instruction encodes N scalar instructions and starts multiple data
operations[11, 6]. Vectors eliminate the bandwidth from fetching many instructions
in order to perform many operations[11]. Multiple instruction multiple data (MIMD)
processors can be less efficient than vector architectures because of fetching overhead
before every operation[6]. Data-parallelism on vector architectures can provide per-
formance as fast a single cycle per instruction, for instructions which perform tens
1
of arithmetic computations[8]. Vector architectures are easily programmable and
highly flexible while also contributing power and performance metrics rivaling cus-
tom designs[8]. Vector instruction sets allow the programmer to maintain sequential
ordering while achieving performance increases from data-parallel operations[6]. Be-
cause vector architectures are programmable, fabricated versions can be re-purposed
in different applications[11].
RISC-V International is in the process of developing a ”RISC-V ’V’ Vector Extension”
to bring SIMD instructions to the RISC-V instruction set[9]. Up until the appear-
ance of RISC-V, most widely used architectures, primarily chips from Intel and ARM,
remained proprietary[4]. The introduction of the free, open-source RISC-V instruc-
tion set architecture brought unprecedented flexibility to the computing industry[4].
This paper presents an implementation of a subset of the v0.7.x proposed ”RISC-V
’V’ Vector Extension” instructions. All implementation was completed using System
Verilog and Vivado Design Suite for the BASYS3 Artix-7 FPGA.
1.1 Related Work
A similar vector architecture to the proposed ”RISC-V ’V’ Vector Extension” is
”VMIPS”, which is a vector extension of the MIPS (millions of instructions per
second) architecture[6]. Some characteristics of VMIPS which are similar to this
implementation is the existence of a ”vector-length register” for storing the active
vector length[6]. Additionally, VMIPS restricts operations of element i of one vector
to be involved only in operations of element i of another vector [6]. This simplifies
implementation. This implementation, discussed in more detail in the later sections,
provides a variable amount of computation lanes in the functional unit to maintain
single-cycle vector operations whereas VMIPS provides a limited amount of lanes but
2
implements pipe-lining within its functional units[6]. Another vector architecture,
”CODE”, utilizes a clustered vector register file rather than a centralized vector reg-
ister file[8]. The OTTER Vector extension follows precedents like VIRAM, Tarantula,
and AltiVec and opts for a single centralized vector register file, but a single integer
functional unit for simplicity[8]. This implementation is meant to be less complex
without hindering performance.
With regards to memory, this implementation has variable alignment requirements for
increased flexibility. Predecessors like AltiVec and SSE have strict memory alignment
constraints[7, 2]. The Spert-II vector microprocessor enables unit-stride, strided, and
indexed loads and stores, as does this implementation[11]. This implementation con-
tains a load/store unit to handle vector memory operations, characteristic of several
vector-register architectures such as CRAY-1 and VMIPS[5]. Although this vector
architecture does not currently utilize off-chip DRAM memory, pipelined memory
operations, or multiple memory banks, the goal is to allow for maximum flexibility
for loading and storing elements of different widths - 1, 2, and 4 bytes - and vectors
of different lengths.
This vector architecture includes general-purpose instructions for the implementation
of non-vectorizable code where necessary. Spert-II includes general-purpose opera-
tions as well to ensure it is ”tightly coupled to a fast, general-purpose RISC core”[11].
Most vector architectures only allow for fixed-size elements within a vector register
file[11, 2]. This implementation contains a register to keep track of the current ele-
ment width. This vector processor is meant to be an initial step towards building a
more advanced and efficient vector architecture.
3
Chapter 2
OTTER RISC-V VECTOR EXTENSION OVERVIEW
RISC-V is an open source reduced instruction set computer (RISC) instruction set
architecture (ISA)[9]. Cal Poly has its own implementation of a RISC-V 32-bit integer
hardware architecture called the OTTER. The OTTER implements many of the
core RISC-V instructions. This OTTER vector extension introduced in this paper is
another set of instructions meant to work alongside the existing core instructions. The
vector extension implemented is based on the specification by RISC-V International
[9]. This implementation includes only a subset of the proposed vector instructions.
The purpose of this extension is to incorporate the benefits of vectors with the appeal
of RISC-V architecture. A full listing of the OTTER vector extension instructions is
located in Appendix A of this document.
4
2.1 Vector Terminology
The ”RISC-V ’V’ Vector Extension” specification introduces vector terminology cru-
cial to the understanding of this vector extension. Vector length (VL) refers to the
number of elements which are stored in each vector. This value is stored in a control
and status register (CSR). An element is a single value within a vector. Elements can
have widths of 1 byte, 2 bytes, or 4 bytes. The term standard element width (SEW)
refers to the current width of elements in a vector. SEW occupies bits [4:2] of the
vtype CSR. SEW is encoded according to Table 2.1 below. [9]
Table 2.1: SEW vtype encoding
5
Chapter 3
OTTER VECTOR EXTENSION IMPLEMENTATION
A diagram with the full implementation of the OTTER with vector extension is
featured in Appendix B of this document.
3.1 Vector Configuration
The vector length and standard element width (SEW) CSRs determine the operating
status of the vector extension. These values can be updated using the vsetvl and
vsetvli instructions. The length of a vector in elements is limited by the width of a
vector register and the SEW. For example, if the vector width is 128 bits in the vector
register file, and the SEW is 8 bits, the maximum vector length possible is 128/8,
or 16, elements. The user can alter the width of the vectors in the vector register
file to suit their application needs. Widths of the vector length CSR may need to be
updated to support larger vector widths. Operating outside of these conditions will
produce undefined errors. The vector length and SEW CSRs are shown in Figure 3.1
below.
Figure 3.1: Vector length and SEW CSRs
6
3.2 Vector Register File
The vector register file (VRF) contains 32 vector elements of a user defined width.
The least significant element, index zero, is stored at bits 0-(SEW-1). Elements are
stored contiguously in a vector register from least significant to most significant. A
vector in the VRF is written to element by element for memory operations. Otherwise,
for vector algorithmic operations entire vectors are written at a time. An example
layout for a vector with a length of 8 and a SEW of 16 stored in the vector register
file is shown in Figure 3.2. below.
Figure 3.2: Vector layout in VRF
Figure 3.3: VRF
7
A black box diagram of the VRF is shown in Figure 3.3. The VRF performs syn-
chronous vector-element and whole-vector writes. The element writes are controlled
by the ”vElementWrite” signal, and the vector writes are controlled by the ”vreg-
Write” signal. Data for the whole vector writes is supplied by the ”WholeVector”
input. Likewise, ”ElementData” suppies a single element to write. The ”vidx h”
input tells the VRF the highest bit which the single element is to be written to. Es-
sentially which index inside the vector specified by ”WriteAddr” the element will be
written to. The ”WriteAddr” input specifies which vector of the 32 available vectors
will be overwritten. The VRF reads are asynchronous. The ”vm” output is always
the vector mask at v0. The ”vs1” and ”vs2” outputs are the vectors specified by
”VecRead1” and ”VecRead2”, respectively.
3.2.1 Vector Masking
Some vector operations have a masked form. The vector mask is stored in vector
register 0, v0. When masked operations are performed, the operation is only com-
puted for the elements which the corresponding element in the vector mask has a
value of 1, not 0. Elements for which the vector mask is 0 will retain the value of the
source operand vs2 for VALU operations, if vs2 is specifiable. For the masked vmv
instruction, all elements for which the vector mask is 0 will be 0 in the destination
vector. The vector mask can be written using vector load instructions. A possible
vector mask for a vector with a length of 8 and SEW of 2 bytes is shown in Figure
3.4:
Figure 3.4: Example of a vector mask stored at v0
8
3.3 Vector Loads and Stores
This implementation includes three types of vector loads and stores: unit-stride,
strided, and indexed. Unit-stride loads and stores assume the vector elements are
stored contiguously in memory as an array of SEW-width elements. Strided loads
and stores assume the vector elements are stored contiguously in memory as an array
of byte-stride-width elements. Indexed elements use another vector register as a list
of byte-wise indexes into memory where each corresponding vector element is stored.
A few different components were required to implement all three types of loads and
stores with variable length and element-width vectors. These components consisted
primarily encapsulated within the vector memory assist unit, described in the next
section.
3.3.1 Vector Memory Assist
The vector memory assist is essentially an extension of the control unit from the
multi-cycle OTTER implementation. However, in addition to sending controls to the
memory, it sends controls to the vector register file, an internal element counter, an
external offset counter, and external element-select multiplexers. A state diagram
for the vector memory assist unit is shown in Figure 3.5. The vector memory assist
sends the necessary controls to perform vector loads and stores. Because the vectors
are variable length, an internal counter keeps track of the current element being
loaded/stored. The load requires a write back state where the element data loaded
from memory is written to its specified place in the vector register file. The vector
load/store process is started by a signal from the control unit. Once it has started,
it asserts a hold signal to communicate to the control unit there is a load or store in
9
Figure 3.5: Vector Memory Assist Unit
progress. The hold signal is de-asserted in the last state to tell the control unit to
move forward to the next instruction.
Load and store offsets are calculated by an external offset counter. This counter loads
in the given base address from a general purpose register, and then adds either the
unit-stride or byte-stride for each successive element. The enable and load for this
counter are controlled by the vector memory assist unit. To compute the indexed
offset, a vector-element select multiplexer uses the ”VIDX H” output from the vector
memory assist unit to choose an element from a vector. This element value is added
to the base address from the general purpose register and used as the indexed address
10
into memory. Another multiplexer chooses whether to pass the strided or indexed
memory address to the memory module. Additionally, the ”VIDX H” signal is used
to choose which vector index will be stored into memory or which vector index will
be written to with a value loaded from memory. These components are shown in the
RISC-V Vector Extension diagram.
Although this implementation requires at least a single clock cycle for each vector
element, it is highly flexible for vectors of different lengths and SEWs. It interfaces
directly with the existing OTTER memory module and control unit, and is fairly
simplistic.
3.4 Vector Algorithmic Logic Unit
The vector algorithmic logic unit (VALU) implements operations for vector addition,
subtraction, bit-wise-and, bit-wise-or, bit-wise-xor, move, and element-wise barrel
shifts. Vector addition, subtraction, move, and, or, xor have three forms: vector-
vector, vector-register, and vector immediate. The vector-vector form performs the
operation on corresponding vector elements of two vectors and stores the result in
the corresponding index of the destination vector. The vector register form performs
the operation on each element of a vector with the value contained within a general
purpose register, and stores the result in the corresponding element of the destination
vector. The vector-immediate form is much like the vector-register form, except the
value which is used in the operation is a 5-bit immediate value. For example, a vector-
vector addition of vectors A and B would take the element at index 0 of A and index
0 of B, add them, and then store the result at index 0 of vector C. This is repeated
for the remaining elements. The location of ”index 0” is determined by the value in
the SEW CSR. A vector-register addition of vector A and register X would take the
11
element at index 0 of A, add the value in register X, store the result at index 0 of
vector C, and repeat with the same register value for the remaining elements in the
vector. Vector operations truncate the results of all operations to SEW-width. The
implementation of the vector ALU allows for each of these operations to complete
within a single 20 ns clock cycle while also accounting for variability in SEW and
vector length. All vector computational instructions have masked forms except for
the barrel shifts. Refer to the previous discussion of vector masking for a more detailed
description of this behavior.
12
Chapter 4
RESULTS & ANALYSIS
To assess the performance of the OTTER Vector Extension, programs with identical
behavior were written with and without the vector extension instructions. The non-
vector programs were written using instructions from the RISC-V base 32-bit integer
instruction set and run on the baseline multi-cycle OTTER. The timing results from
running all programs were acquired using a Vivado simulation since access to FPGAs
was limited.
The first assembly program compared adds the values of two arrays both 16 bytes in
length, and stores the resulting values in a third array. Behavior is more accurately
described by the snippet of C code below:
u i n t 8 t a [ 1 6 ] = {1 ,3 , 5 , 7 , 9 , 11 ,13 ,15 ,17 ,19 ,21 ,23 ,25 ,
27 ,29 ,31} ;
u i n t 8 t b [ 1 6 ] = {2 ,4 , 6 , 8 , 10 ,12 ,14 ,16 ,18 ,20 ,22 ,24 ,26 ,
28 ,30 ,32} ;
u i n t 8 t c [ 1 6 ] ;
f o r ( i n t i = 0 ; i < 16 , i++)
c [ i ] = a [ i ] + b [ i ] ;
The OTTER Vector Extension implementation of this program sets the vector length
to 16 and the SEW to 8 bits. It loads each vector from memory, performs an addition,
and stores the resulting vector back to memory.
13
. data
vector1 : . byte 1 ,3 , 5 , 7 , 9 , 11 ,13 ,15 ,17 ,19 ,21 ,23 ,25 ,27 ,29 ,31
vector2 : . byte 2 ,4 , 6 , 8 , 10 ,12 ,14 ,16 ,18 ,20 ,22 ,24 ,26 ,28 ,30 ,32
vector3 : . byte 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0
. t ex t
. g l o b a l main
. type main , @function
main :
l a x5 , vec tor1
l a x6 , vec tor2
l a x7 , vec tor3
l i x4 , 16
v s e t v l i x0 , x4 , e8
vlb . v v1 , ( x5 )
vlb . v v2 , ( x6 )
vadd . vv v3 , v1 , v2
vsb . v v3 , ( x7 )
j main
The non-vector version of this program is less straightforward. It must load a byte at
a time from memory from each array, perform the addition, and store the byte. After
this the addresses must be adjusted so the next byte of each array can be loaded
from/stored to. It also uses a loop to repeat this operation 16 times.
14
. data
vector1 : . byte 1 ,3 , 5 , 7 , 9 , 11 ,13 ,15 ,17 ,19 ,21 ,23 ,25 ,27 ,29 ,31
vector2 : . byte 2 ,4 , 6 , 8 , 10 ,12 ,14 ,16 ,18 ,20 ,22 ,24 ,26 ,28 ,30 ,32
vector3 : . byte 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0
. t ex t
. g l o b a l main
. type main , @function
main :
l a x5 , vec tor1
l a x6 , vec tor2
l a x7 , vec tor3
l i x1 , 16
loop :
lb x9 , 0( x5 )
lb x10 , 0( x6 )
add x2 , x9 , x10
sb x2 , 0( x7 )
addi x5 , x5 , 1
addi x6 , x6 , 1
addi x7 , x7 , 1
addi x1 , x1 , −1
bnez x1 , loop
j main
15
The total amount of clock cycles it took for the OTTER Vector Extension program
to complete was 115. This includes initial setup of the stack pointer not shown in the
code. The total amount of clock cycles it took for the non-vector version was 366.
Results are summarized in Table 4.1. This includes the same initial setup sequence.
The vectorized version produced a speedup by a factor of 3. One of the primary
reasons for the large speedup is the addition operation in the vectorized version only
takes 1 clock cycle to complete, whereas the addition in the non-vectorized version
requires 16 separate cycles plus a fetch cycle for each. However, the speedup is much
less significant for the loading and storing of the vectors. The vectorized version still
requires at least 2 clock cycles for every element loaded from memory. In this case
there are 16 elements, so at least 64 of the clock cycles are spent loading the vectors
from memory, and another 16 are spent storing the resulting vector back to memory.
However, there is still a slight speedup because the vector loads for each element
occur back-to-back without the need for a fetch cycle in between.
Table 4.1: Array Addition Speedup Results
All of the vector ALU operations have a single-cycle execution time. Thus, this
example could be repeated for subtraction, bitwise-AND, bitwise-OR, bitwise-XOR,
and move, with similar, if not identical, speedup results. Although the example
shown is small, it encompasses sequence of operations widely used throughout many
different applications which is load, compute, store. As long as loop iterations are
independent, as they are in this example, vectors can be employed to increase the
performance.
16
A second program example essentially implements the memcpy function in C. mem-
cpy copies bytes from one location in memory to another location. The vectorized
implementation is shown below:
17
. data
source : . byte 1 ,3 , 5 , 7 , 9 , 11 , 13 , 15 ,17 ,19 ,21 , 23 ,25 ,27 ,29 , 31 ,
2 , 4 , 6 , 8 , 10 ,12 ,14 ,16 ,18 ,20 ,22 ,24 ,26 ,28 ,30 ,32
dest : . byte 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ,
0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0
. t ex t
. g l o b a l main
. type main , @function
main :
l a x5 , source
l a x6 , des t
l i x4 , 32
l i x3 , 16
v s e t v l i x1 , x3 , e8
loop :
vlb . v v1 , ( x5 )
add x5 , x5 , x1
sub x4 , x4 , x1
vsb . v v1 , ( x6 )
add x6 , x6 , x1
bnez x4 , loop
j main
This implementation completed in 142 clock cycles. The non-vectorized implementa-
tion below completed in 427 clock cycles. This is a speedup of about 3. These results
18
are summarized in Table 4.2.The reason for this is the vectors are able to complete
16 loads back-back without fetching another instruction. The same is true for the
stores. Additionally, the vectorized loop runs twice, where the non-vectorized loop
must run once for each byte. So, although the same amount of loads and stores are
occurring, using vectors eliminates a lot of flow-control overhead.
. data
source : . byte 1 ,3 , 5 , 7 , 9 , 11 , 13 , 15 ,17 ,19 ,21 , 23 ,25 ,27 ,29 , 31 ,
2 , 4 , 6 , 8 , 10 ,12 ,14 ,16 ,18 ,20 ,22 ,24 ,26 ,28 ,30 ,32
dest : . byte 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ,
0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0
. t ex t
. g l o b a l main
. type main , @function
main :
l a x5 , source
l a x6 , des t
l i x1 , 32
loop :
lb x9 , 0( x5 )
sb x9 , 0( x6 )
addi x5 , x5 , 1
addi x6 , x6 , 1
addi x1 , x1 , −1
bnez x1 , loop
j main
19
Table 4.2: memcpy Speedup Results
The examples shown in this section operated on single-byte elements, but can also be
implemented using half-word and word size elements.
Overall, adding vectors resulted in significant speedup for certain operations. The
most significant speedup will be for computationally intensive programs - lots of
vector operations and minimal loads and stores - but there was still a performance
increase for load-store intensive programs. Ideally, vectors can be used for efficiency
in cryptography or image processing programs. This implementation was specifically
written for implementing a RISC-V version of the Advanced Encryption Standard
(AES).
20
Chapter 5
FUTURE WORK
This implementation will hopefully act as a starting point for the OTTER vector
extension, and other RISC-V projects. However, s number of improvements could
be made in order to increase the functionality and efficiency of this version of the
OTTER.
Although the VALU provides adequate performance increases, loading and storing
vectors efficiently while maintaining the length and SEW flexibility is a problem yet to
be solved. Memory bandwidth is often a limiting factor for vector processing. Possible
improvements could include redesigning the memory module to allow for pipe-lined
memory reads and writes. To overcome memory latency, many vector processors
spread memory accesses across different memory banks[5]. Another design approach
could allow entire vectors to be read or written to or from memory at once for unit-
stride and strided loads and stores, in order to achieve minimal cycle loads and stores.
A major issue in vector supercomputers is providing such large bandwidth for memory
accesses[5]. The memory module used in this implementation was designed specifically
for the BASYS3 Artix-7 FPGA, but utilizing a different FPGA or external DRAM
memory with caches would provide a more usable processor. DRAM is typically used
for main memory because it is inexpensive [6].
Other improvements could include the implementation of more instructions from the
RISC-V V Extension specification. Fixed-point and floating-point instructions may
prove necessary for a range of applications. There are also vector widening and re-
duction operations. In particular, exception handling would be practical for detecting
21
errors. Most vector architectures contain multiple fully-pipelined functional units for
increased performance [11, 6].
Also worthy of mentioning, this implementation was based off of the current version
(0.7.x) of the ”RISC-V ’V’ Vector Extension” specification. The ’V’ extension has
not yet been ratified and stabilized, so this implementation will need to be updated
with each new release until the extension has been finalized.
Support does not yet exist for vectorization of C code into RISC-V instructions, but
may become available in the near future with LLVM or gcc. Further research into
this could make this extension more usable and appealing. The use of vectorization
technology to translate code written in high-level languages to vector instructions
may expand use cases for this extension [2].
For higher performance, adding vectors to the pipe-lined OTTER is another step
forward. In addition, a multi-core pipe-lined OTTER with vector capability would
provide both thread-level and data-level parallelism. Clearly, adding vectors to the
OTTER is a small start in a rather long, but optimistically worthwhile journey.
22
BIBLIOGRAPHY
[1] Cal Poly Github. http://www.github.com/CalPoly.
[2] A. E. Eichenberger, P. Wu, and K. O’Brien. Vectorization for SIMD
Architectures with Alignment Constraints. SIGPLAN Not., 39(6):82–93,
June 2004.
[3] A. K. Eman Aldakheel, Ganesh Chandrasekaran. Vector Processors, 2012.
https://www.cs.uic.edu/~ajayk/c566/VectorProcessors.pdf.
[4] S. Greengard. Will RISC-V Revolutionize Computing? Communications of the
ACM, 63(5):30–32, 2020.
https://cacm.acm.org/magazines/2020/5/244325-will-risc-v-
revolutionize-computing/fulltext.
[5] J. L. Hennessy and D. A. Patterson. Innovation and intellectual property
rights. In T. Green and N. McFadden, editors, Computer Architecture: A
Quantitative Approach, pages G2–G34. Elsevier, Inc, 225 Wyman Street,
Waltham, MA 02451, USA, 2012.
[6] J. L. Hennessy and D. A. Patterson. Innovation and intellectual property
rights. In T. Green and N. McFadden, editors, Computer Architecture: A
Quantitative Approach, chapter 4, pages 261–342. Elsevier, Inc, 225 Wyman
Street, Waltham, MA 02451, USA, 2012.
[7] R. Jr and V. Adve. Vector LLVA: A Virtual Vector Instruction Set for Media
Processing. volume 2006, pages 46–56, 06 2006.
23
[8] C. Kozyrakis and D. Patterson. Overcoming the Limitations of Conventional
Vector Processors. 2003.
http://iram.cs.berkeley.edu/papers/code.kozyraki.isca.2003.pdf.
[9] RISC-V ”V” Vector Extension, 2019.
https://github.com/riscv/riscv-v-spec/releases/tag/0.7.1.
[10] Vectorization: SIMD Parallelism.
https://cvw.cac.cornell.edu/vector/overview_simd.
[11] J. Wawrzynek, K. Asanovic, B. Kingsbury, D. Johnson, J. Beck, and
N. Morgan. Spert-II: A Vector Microprocessor System. Computer,
29(3):79–86, 1996.
24
APPENDICES
Appendix A
25
Table A.1: OTTER Vector Extension Instructions
26
27
28
Appendix B
Figure B.1: OTTER with Vector Extension
29
