The Design of a Custom 32-Bit SIMD Enhanced Digital Signal Processor by Simha, Shashank








Follow this and additional works at: http://scholarworks.rit.edu/theses
This Master's Project is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for
inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact ritscholarworks@rit.edu.
Recommended Citation
Simha, Shashank, "The Design of a Custom 32-Bit SIMD Enhanced Digital Signal Processor" (2017). Thesis. Rochester Institute of
Technology. Accessed from




Submitted in partial fulfillment




Mr. Mark A. Indovina, Lecturer
Graduate Research Advisor, Department of Electrical and Microelectronic Engineering
Dr. Sohail A. Dianat, Professor
Department Head, Department of Electrical and Microelectronic Engineering
Department of Electrical and Microelectronic Engineering
Kate Gleason College of Engineering
Rochester Institute of Technology
Rochester, New York
December 2017
To my family and friends, for all of their endless love, support, and encouragement
throughout my career at Rochester Institute of Technology
Declaration
I hereby declare that except where specific reference is made to the work of others, the
contents of this paper are original and have not been submitted in whole or in part for
consideration for any other degree or qualification in this, or any other University. This
paper is the result of my own work and includes nothing which is the outcome of work




I would like to thank my advisor, professor, and mentor, Mark A. Indovina, for all of his
guidance throughout the entirety of this project. The continuous feedback and motivation
provided by him has been a major driving force to push myself beyond limits throughout
my career at RIT, for which I am truly grateful. His passion for teaching, expertise in
digital design, along with decades of industrial experience has established him as my role
model in the field. His advice, methods of teaching, managing and cross-domain knowledge
has been a huge inspiration for me to pursue a career in the VLSI and digital design.
I would like to thank Dr. Dorin Patru and Dr. Marcin Lukowiak for providing me
valuable knowledge and feedback in topics of computer architecture and FPGA, which
provided a firm foundation in my understanding of the topics.
I would like to thank my parents for their continuous support throughout my career at
RIT, believing in me and my being biggest role models. They have always been my pillars
of support and great motivators throughout my life, at and away from home.
I would also like to thank my roommates for being my brothers throughout the two
years of graduate school.
I finally would like to thank all my classmates and TA’s for their invaluable guidance
and support throughout my entire career at RIT.
Abstract
For a number of years, the hardware industry has seen a drastic rise in embedded appli-
cations. Thanks to the Internet of Things (IoT) revolution, a majority of these embed-
ded applications are shifting towards the usage of simple hardware capable of running on
batteries, while being able to handle complex data and implement complex algorithms.
Translating these requirements to digital design terms, the hardware is expected to have
high power efficiency, be tiny and simple enough, while being capable of meeting real-
time constraints and process mathematical algorithms. Looking at some of the modern
DSPs, most of them have been targeting high performance and wider applications, usually
resulting in higher power consumption and complex hardware.
The main motivation of this paper was to implement a simple DSP design, optimized for
power efficiency, while being capable of handling simple multimedia applications. Hence,
an enhanced version of TMS32010 DSP is implemented with numerous modifications to
the architecture, ISA, memory addressing and pipeline structure. The major enhancements
include the addition of instruction level parallelism using SIMD instructions, use of a much
larger data memory to be able to accommodate a larger amount of data in multimedia
applications, and expansion of the data-word to 32-bits to be able support packed SIMD
data and fully utilize the 32-bit ALU. The ISA, pipeline and memory access enhancements






List of Figures vii
List of Tables viii
1 Introduction 1
1.1 DSP classifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 History of DSPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Brief introduction to the DSP design and paper organization . . . . . . . . 6
2 DSP architecture 8
2.1 Top level block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Internal blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Address decode unit . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Execution unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 ALU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Instruction Set Architecture of the DSP 17
3.1 Instruction and data word expansion . . . . . . . . . . . . . . . . . . . . . 18
3.2 Addressing modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Direct addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Indirect addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Instruction opcodes and operation . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.1 List of instructions and corresponding opcodes . . . . . . . . . . . . 23
3.3.2 Description of the operation of each instruction . . . . . . . . . . . 27
Contents vi
4 DSP Pipeline and Read/Write RAM buffer wrapper implementation 32
4.1 Pipeline implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.1 Pipeline stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.2 Pipeline design for non-branching instructions . . . . . . . . . . . . 35
4.1.3 Pipeline design for unconditonal branching instructions . . . . . . . 37
4.1.4 Pipeline design for conditional branching instructions . . . . . . . . 40
4.2 Read/write RAM buffer wrapper . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.1 RAM read/write problem description . . . . . . . . . . . . . . . . . 44
4.2.2 Design and implementation of read/write buffer wrapper . . . . . . 45
5 Median filter design 47
5.1 Median filter overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 Median filter design and implementation . . . . . . . . . . . . . . . . . . . 48
6 Results 52
6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7 Conclusions and future work 54
7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
References 56
I Source Code I-1
I.1 RTL source code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-1
I.1.1 DSP top level module . . . . . . . . . . . . . . . . . . . . . . . . . I-1
I.1.2 ALU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-25
I.1.3 Input shifter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-32
I.1.4 Output shifter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-35
I.1.5 Compare select unit . . . . . . . . . . . . . . . . . . . . . . . . . . I-38
I.1.6 Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-39
I.1.7 Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-40
I.2 Assembler designed in Perl . . . . . . . . . . . . . . . . . . . . . . . . . . . I-41
I.3 Assembly source code for testing and median filter . . . . . . . . . . . . . . I-55
I.3.1 Assembly code used for basic level testing . . . . . . . . . . . . . . I-55
I.3.2 Assembly code used for median filter algorithm . . . . . . . . . . . I-57
List of Figures
1.1 Fixed and floating point illustration . . . . . . . . . . . . . . . . . . . . . . 2
2.1 Top-level block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Address decode unit block diagram . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Execution unit block diagram . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 ALU block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Instruction word expansion for various instructions . . . . . . . . . . . . . 19
3.2 Data word exapansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Direct addressing illustration . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Indirect addressing illustration . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1 Pipeline stages and implementation . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Pipeline example for memory read instructions . . . . . . . . . . . . . . . . 36
4.3 Pipeline example for memory write instructions . . . . . . . . . . . . . . . 38
4.4 Pipeline example for unconditional branching . . . . . . . . . . . . . . . . 40
4.5 Pipeline implementation example for conditional branch instruction, when
condition is false . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 Pipeline implementation example for conditional branch instruction, when
condition is true . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.7 Read/write RAM buffer wrapper state machine . . . . . . . . . . . . . . . 45
5.1 Median filter working illustration . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Median filter algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3 Median filter algorithm implementation illustration for a 3× 3 window . . 51
List of Tables
3.1 List of Instructions and their opcodes . . . . . . . . . . . . . . . . . . . . . 23
3.1 List of Instructions and their opcodes . . . . . . . . . . . . . . . . . . . . . 24
3.1 List of Instructions and their opcodes . . . . . . . . . . . . . . . . . . . . . 25
3.1 List of Instructions and their opcodes . . . . . . . . . . . . . . . . . . . . . 26
3.1 List of Instructions and their opcodes . . . . . . . . . . . . . . . . . . . . . 27
3.2 List of instructions and their operations . . . . . . . . . . . . . . . . . . . . 28
3.2 List of instructions and their operations . . . . . . . . . . . . . . . . . . . . 29
3.2 List of instructions and their operations . . . . . . . . . . . . . . . . . . . . 30
3.2 List of instructions and their operations . . . . . . . . . . . . . . . . . . . . 31
6.1 Synthesis results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Chapter 1
Introduction
With advancement in technology, the world has been seeing exponential increase in the
amount of data stored and processed ever since computers have been invented. A major
part of this data represents multimedia, which is essentially either audio or image data [1].
To clearly compress, restore, process and understand image data, numerous mathematical
algorithms have been implemented in computing, which are usually quite complex. After
the invention of general purpose processors, there were many applications where a lot of its
functions were not required by the application, or used by limited applications [2]. And,
these processors took too much time to compute the mathematically intense algorithms in
real time, which the hardware was simply not built to handle. This market was targeted by
DSPs (Digital Signal Processors). DSPs have historically been used in such applications to
increase the speed of computing by implementing complex hardware and parallel computing
[3].
1.1 DSP classifications 2
Figure 1.1: Fixed and floating point illustration
1.1 DSP classifications
DSPs are broadly classified into fixed and floating-point architectures. Fixed-point DSPs
are designed to handle positive or negative integer data, while floating-point DSPs are
designed to handle rational number data. The representation of data stored in each of
these DSPs hence is different, which is the major reason behind the classification since it
directly affects the amount of hardware required for each implementation. The fixed-point
data is represented by the integer’s sign in the MSB (Most Significant Bit) followed by its
value in the following bits. Floating-point data is represented by the rational number’s
sign in the MSB, followed by its exponent, and later its mantissa. Fig. 1.1 illustrates fixed
and floating point representations.[4]
Generally, fixed point implementations are faster, cheaper, more power efficient, sim-
pler to design and verify, and require less time-to-market. Floating point implementation
1.2 History of DSPs 3
trades-off all these factors for better precision and faster computation of floating point
data. Hence most of the times, either of the architectures is selected mainly based on the
application. It is also worth noting that some DSPs are equally efficient in implementing
either architectures like the SHARC DSP by Analog Devices.
Architecturally, the amount of hardware and design effort required to implement float-
ing point precision is obviously much higher than fixed point precision. First, the data unit
must be expanded from 16-bits to 32-bits at least along with the memory and registers.
The ISA itself would have to be expanded significantly as almost all floating point DSPs
usually support fixed point operations along with floating point ones [5].
From a compilation point of view, C language has an in-built type for floating-point
to fully exploit the hardware capability of floating point DSPs. While the C compiler
takes advantage of the floating point hardware, some rules and regulations are followed to
ensure that the data fits within the 32-bit or 64-bit data word. Fixed point C compilation is
implemented by mapping integers to fixed point data. The problem with fixed point though
is that there is no ANSI standard for fixed point, hence it usually requires additional code
for conversions and shifts. The efficiency of fixed point compilation takes another dip, since
fixed point specific instructions are not built-in [6].
1.2 History of DSPs
The main motivation for DSP was to have powerful hardware with more application specific
functions and instructions, when compared to a general-purpose processor. This is very
evident throughout the evolution of DSPs looking at the various applications throughout
the past few decades. Moreover, it is obvious that all early DSPs were fixed-point archi-
tectures mainly because there was no floating-point standardization until early 80’s. With
1.2 History of DSPs 4
applications ranging from audio systems, speech processing, SONAR to medical imaging,
RADAR, DSPs are used almost in every field today [7]. It is also interesting to note early
applications of DSPs in personal computers like the Motorola 56000 used in the Atari
Falcon, NeXT and SGI workstations.
DSPs have been produced by almost all major semiconductor companies including Intel,
AMD, Texas Instruments, Motorola and Analog Devices at some point of time. Most of
the early DSPs targeted audio processing, such as the Speak & Spell by Texas Instruments.
Throughout the evolution of DSPs, they have grown more and more application specific
over time, rather than the other way around [5].
Speak & Spell, an early toy used to teach kids to spell words, launched in 1976, was
the earliest mass-produced DSP product in the market, powered by the Texas Instrument
TMS5100 DSP [8]. Interestingly, in the late 70’s Intel unsuccessfully tried to enter the DSP
market early with their 2920 analog processor, which failed mainly because of the absence
of a true multiplier. The first attempts of DSP devices include the AT&T DSP1 and NEC
µPD7720. It is worth noting that DSP1 introduced the historic MAC instruction to the
world, this was one of the earliest steps of implementing instruction level parallelism in
DSPs [9].
The first generation of DSPs started appearing in the market in the early 1980’s. Some
key features of the DSPs of this generation were Harvard architecture and multiply-add-
accumulate instructions. The TMS32010, from this generation of DSPs, by Texas In-
struments was notably one of the most successful DSPs in history, as it pushed Texas
Instruments to be the market leader in DSPs. Since it was based on Harvard architecture
and with specialized ISA, it was the fastest DSP at the time [9].
The next generation DSPs, from late 1980’s to early 1990’s featured advanced archi-
tectures with capability of handling much complex applications. Motorola entered the
1.2 History of DSPs 5
DSP market with their popular fixed-point DSP56000 featuring 24-bit program and data
words. The second generations of DSPs featured further optimization in memory architec-
ture, with architectures capable of accessing multiple data memories in a single instruction.
This generation also brought floating-point DSP architecture into market. Examples for
this include the SHARC series of DSPs by Analog Devices, calling the architecture Super
Harvard architecture. Interestingly, shrinking fabrication technology also had a huge im-
pact on this generation of DSPs, as more and more hardware could be fit into the chip
while still keeping it tiny in size [9].
Late 90’s DSPs incorporated more application-specific instructions, as they were mostly
used as coprocessors along with the main CPU. Many DSPs however lost market when
CPUs became SIMD capable. Parallel processing capabilities were subsequently introduced
with Single Instruction Multiple Data (SIMD) and Very Long Instruction Word (VLIW)
instructions in the later DSPs. VLIW architectures take advantage of spatial parallelism
along with temporal parallelism, since they utilize several functional units to concurrently
execute multiple operations, while pipelining these functional units [10]. Parallelism was
further boosted with adding multiple cores and threads in later DSPs [9].
Modern DSPs are functionally not very different from the late 90’s ones. Optimizations
in the last decade though has been directed toward DSP compilation strategies. Designing
and making DSP architecture more compiler friendly and making better DSP compilers has
been a very crucial step in DSP evolution, mainly because its applications will tremendously
increase since it is the bridge to the software world. As the software world starts utilizing
DSPs effectively and understanding their capabilities, significant advantages in software
productivity could be achieved [11]. Also, modern DSPs have been trying to incorporate
as much parallelism as possible, with different approaches. The latest Texas Instruments
DSP TMS320C64X is very good example of this, as it combines both VLIW and SIMD
1.3 Brief introduction to the DSP design and paper organization 6
capabilities into the architecture. DSPs from LSI Logic Corp. take a different approach
with their super-scalar architecture, while arguing that superscalar architecture is more
compiler friendly compared to VLIW approach [9].
1.3 Brief introduction to the DSP design and paper
organization
The capabilities of DSPs have evidently evolved at a rapid pace over time, especially since
the 90’s. With the introduction of concepts such as super Harvard architecure, VLIW and
super-scalar architecture, the design complexity of DSPs has also risen at the same pace.
Hence, the need for a simple embedded programmable processor with not only conventional
instructions, but also DSP specific becomes desirable in some applications [12][13][14][15].
The paper [12] shows one such approach where, the TMS32010, a fairly simple monolithic
DSP, is implemented on an FPGA platform.
The DSP design presented is very similar to the TMS32010 by Texas Instruments,
but has major enhancements which are discussed in the later chapters. Looking at some
applications of TMS32010 was a necessary step while deciding what enhancements could
positively impact the DSP applications. Numerous applications of TMS32010 in the area of
image processing involve design of a multiprocessor system using multiple DSPs to imple-
ment a complex algorithm. Paper [16] presents one such application where edge detection
algorithm is implemented using eight TMS32010 DSPs in a multiprocessor configuration,
involving parallel image processing architecture. Another interesting image processing ap-
plication is presented in [17], where the TMS32010 is interfaced with a host processor and
to speed up image processing algorithms. The paper also mentions limitations of the sys-
tem, one of which is the lack of data memory in the TMS32010. This has been one of the
1.3 Brief introduction to the DSP design and paper organization 7
major enhancements in our DSP design.
The following chapters of the paper attempt to explain all details involved in designing
the DSP. Chapter 2 talks about the architecture of the DSP, including the data flow within
the DSP, types of operations and compares it other DSP architectures. Chapter 3 covers
topics like instruction and data word expansions, opcodes of all instructions and addressing
modes and assembler design, in an effort to describe the Instruction Set Architecture (ISA)
of the DSP. Chapter 4 talks about pipeline design and RAM buffer wrapper design. Chapter
5 elaborates one application of the DSP which is the median filter design, and describes
the merits of the DSP with respect to its implementation. Chapters 6 and 7 present the
results and conclusion of the paper.
Chapter 2
DSP architecture
The DSP architecture is very different from that of a general-purpose CPU as discussed
in the previous chapter. One of the biggest bottlenecks in executing DSP algorithms
is transferring information to and from memory [5]. Things like Harvard architecture,
direct memory access (DMA), multiply-accumulate unit (MAC) and barrel shifting are
some features which distinguish DSPs from general purpose processors. Some DSPs have
general purpose registers, like the SHARC ADSP-2106x, and others are accumulator based,
such as the TMS32010 [5].
As noted earlier, the primary advantage of the DSP is its speed. This means its ar-
chitecture needs to be capable of performing complex mathematical calculations within a
single clock cycle. There are different techniques to achieve this architecturally. Firstly,
pipelining the architecture ensures that most instructions are executed within a single clock
cycle, but have a latency between the input instruction and its output equal to the number
of pipelined stages. While this approach is attractive, one thing to remember is that the
same resource can not be used in multiple stages. The second solution is parallel execution
of multiple tasks. The important point to remember here that these tasks should not have
9dependencies, and off course can not use the same resources. DSP architecture is usually
designed by combining both techniques to accomplish the speed.
The DSP’s architecture being its most important feature, its significant difference of
the from that of conventional microprocessors comes obvious. The basic capability of inte-
grating a multiplier/ accumulator into its data-path has been proven to be revolutionary in
computing multiple algorithms. Other factors such as preserving the precision of the prod-
uct after multiplication, having shift capability while storing accumulator into the memory
and handling overflow are crucial for DSP architecture, since most of its applications are
usually complex arithmetic operations, requiring precise calculations [18].
While chapters 3 and 4 deal with how exactly each instruction is planned and executed,
and pipeline design of the DSP respectively, this chapter starts by taking a brief look at
the architecture of top-level design. Later in the chapter, all other lower level blocks of the
DSP including functional blocks of the architecture are discussed in detail.
2.1 Top level block diagram 10
2.1 Top level block diagram
Figure 2.1: Top-level block diagram
The top-level block includes the DSP, ROM, RAM and read/write buffer wrapper. Fig.
2.1 attempts to visually describe the top-level view of the DSP system along with an
abstracted high-level view of the interconnections between these components. It is to be
noted that all the modules within the DSP are functional blocks, not pipeline stages. And,
the necessity and usage of read-write RAM buffer is explained in chapter 4.
The DSP sends a new value for program counter (PC) every cycle, which goes to
the ROM. The ROM returns the instruction corresponding to the previous PC, back to
the DSP. Later, depending on the instruction, the DSP then sends a read address to
the read/write RAM buffer to fetch operand/value from the RAM. The read/write RAM
buffer in turn communicates this address to the RAM and accordingly fetches data from
the RAM. This data is sent to the DSP, which executes the instruction and computes the
2.2 Internal blocks 11
result. Lastly, depending on the following instruction, this result is saved into the RAM
via the read/write RAM buffer.
2.2 Internal blocks
As discussed in the previous section, the DSP needs to perform the following functions for
every instruction in the same order:
1. Fetch the instruction from ROM,
2. Decode this instruction,
3. Fetch data from RAM if necessary,
4. Execution the instruction, and
5. Save the result back into the RAM if required.
While the pipeline takes care of distributing these functions across multiple clock cycles,
it is necessary to carefully plan the hardware necessary to perform each function.
The address decode unit 2.2.1 section describes how the instruction is broken into pieces
as soon as the DSP receives it from the ROM. As the logic remains same for almost all
instructions, most its implementation is described within the small subsection. However,
execution being a more complex task, as it is unique for every instruction, requires further
planning. Hence, the ALU 2.2.3 is separately discussed, following a brief look at the
execution unit 2.2.2.
2.2 Internal blocks 12
2.2.1 Address decode unit
Fig.2.2 shows the block diagram of decode unit. The decode unit supports two addressing
modes, namely direct and indirect addressing modes. As clearly shown in the figure, eight
Auxiliary Registers (ARs) are used for indirect addressing. The AR pointer (ARP) is used
to indicate which AR is to be used. Chapter 3 discusses addressing modes in detail.
Direct addressing does not require much logic to decode, as the instruction itself con-
tains almost all required details to generate the data memory address. Here, the least
significant or LSB 7-bits of the instruction is simply concatenated with the contents of the
data page pointer (DPP) to generate the data memory address.
While using indirect addressing, the instruction specifies the following details: which
AR will be used for the next indirect addressing or the next AR pointer (NARP), and
whether the contents of the current AR is to be incremented/decremented by one or not.
The AR however contains the address of data to be fetched.
2.2 Internal blocks 13
Figure 2.2: Address decode unit block diagram
2.2.2 Execution unit
Fig.2.3 describes the block diagram of the execution unit. The execution unit is broken
down into three major functional blocks: ALU, barrel shifter and multiplier. Apart from
these, all the other components in the figure are used to facilitate the interconnection
2.2 Internal blocks 14
between these blocks, as directed by the instruction word decode logic.
Figure 2.3: Execution unit block diagram
2.2 Internal blocks 15
Two 32-bit barrel-shifters are used in the DSP. The input shifter is used to shift one of
the ALU inputs, while the output shifter is used to shift the result of the ALU. The input
data to the first shifter is read from the RAM. The second shifter is however used only
while storing or writing back the accumulator contents into the RAM.
A 16x16 bit multiplier is used for multiplication operations. While one input to the
multiplier always comes from the T-register, the other input can either be read from the
RAM, or directly read from the instruction. The output however is always stored in the
P-register.
The DSP is designed such that one of the ALU inputs is always the accumulator, while
the other input is loaded either from the output of the first shifter or from the P-register.
The result of the ALU is always fed back into the accumulator.
2.2.3 ALU
To accommodate SIMD instructions into the ISA, the ALU is optimized for add/subtract
SIMD instructions. Fig. 2.4 shows the block diagram of the ALU. The main goal while de-
signing the ALU was to add minimal hardware to TMS32010 design, while also supporting
SIMD instructions.
Four sets of 8-bit adders were used to implement 32-bit addition-subtraction operations,
as well as 8-bit SIMD addition-subtraction operations. The only change internally to
the adders was the additional multiplexers between the carry signals of the consequent
adders, which were set to zero for SIMD add. For the subtraction operation though, 32-bit
2’s complement was fed to the adders in the case of 32-bit subtraction operation, while
separate 2’s complements were fed to the adders for each set of 8-bits in the case of SIMD
subtractor.
2.2 Internal blocks 16
Figure 2.4: ALU block diagram
Chapter 3
Instruction Set Architecture of the
DSP
One of the biggest challenges while designing a DSP is to strike a fair balance between
the hardware complexity for implementing a particular ISA (Instruction Set Architecture),
and possible applications of almost every instruction. For the past few decades, major DSP
manufacturers have been experimenting with different instructions and ISAs to maximize
the applications of their DSPs across various fields. While having a complex ISA and
hundreds of instructions looks like a clear winner, a significantly huge number of DSPs
are used in embedded and real-time applications where power, size and cost are extremely
important. Interestingly, with the advent of Internet of things (IoT), there has been an
exponential increase in such applications. Here, ISA complexity needs to be traded off for
flexibility and robustness.
Another very important factor to consider is the type of addressing, or how easy or
flexible the address conversion logic is for a user. This part is even more crucial in DSPs
as almost all DSPs fetch data from the RAM for each ALU operation, unlike CPUs where
3.1 Instruction and data word expansion 18
numerous general-purpose registers are used as ALU operands. And, it is quite obvious
that most of the DSP operations would be ALU operations. [19]
The proposed DSP architecture is very similar to TMS32010 in its ISA. In the next
few sections, the instruction and data words of the DSP, addressing modes and instruction
opcodes along with operations are briefly described.
3.1 Instruction and data word expansion
To be able to fit in all the instructions and corresponding data required for each instruction
within 16-bits of instruction word, five types of instruction words are planned. Fig. 3.1
shows the expansions of these instruction words. It is to be noted that the expansions indi-
cated in the figure are only for direct addressing mode. In the case of indirect addressing,
the last 7-bits are used differently. Bits 0, 1 and 2 are used to indicate the value of next
AR pointer (NARP), while bits 5 and 6 are used to indicate post increment/decrement
operation for the current AR.
Data words can be stored in four different formats, depending on the type of instruction
used to handle them. Fig. 3.2 shows all variants of data word, where D0, D1, D2 and D3
are 8-bit signed/unsigned integers. While all instructions handled by the ALU are 32-bit
data words, SIMD instructions use the 8-bit variants.
3.2 Addressing modes
As indicated while describing the decode unit in chapter 2, two addressing modes are
implemented in the DSP. These addressing modes function exactly like the TMS32010,
except eight Auxiliary registers are used here instead of two. Also, since the DSP design
3.2 Addressing modes 19
Figure 3.1: Instruction word expansion for various instructions
3.2 Addressing modes 20
Figure 3.2: Data word exapansion
3.2 Addressing modes 21
contains a much larger RAM, the data-page pointer here has been expanded to 8-bits from
TMS32010’s single or double bit versions. [20].
Since multiplication is the only instruction which supports immediate addressing, the
DSP does not really support immediate addressing when it comes to any other operations,
including ALU operations. Therefore, immediate addressing is not claimed to be supported
by the DSP.
3.2.1 Direct addressing
Though direct addressing seems fairly straight forward in implementation, it involves a
two-step process. In the first step, the data-page pointer register needs to be loaded with
the value of the most significant or MSB 8-bits of the address, using a separate load data-
page pointer instruction. The second step is to specify the remaining 7-bits of the address
in the least significant or LSB 7-bits of the instruction, while resetting the eighth bit of
the instruction to indicate direct addressing mode.
Hence, the MSB 8-bits of the RAM address is considered the page. Or in other words,
the RAM has 256 pages, each page consisting of 128 words. This means that, once we load
the page address, accessing any data within the page could be done in a single instruction.
The main drawback of this system though is that every time we need to access a different
page, we must use an additional instruction to load the page address. The advantage
though it that the instruction word remains small, and hence the power efficiency is high
compared to a bigger instruction word. Fig. 3.3 illustrates the working of direct addressing.
3.2 Addressing modes 22
Figure 3.3: Direct addressing illustration
3.2.2 Indirect addressing
Indirect addressing is a multi-step process, since it accomplishes two things: selecting the
next auxiliary register (AR) and incrementing/decrementing the current AR. The ARs
hold address locations for the RAM data to be fetched, hence it is required to load them
prior to using indirect addressing instructions.
The first step is to load one or more ARs with the value of the desired address location/s.
Next, the AR pointer (ARP) needs to be set to the AR containing the next immediate
address to be accessed. Later, the value of this AR can be incremented/ decremented, to
be ready the next time the same AR is accessed.
Comparing indirect and direct addressing, indirect addressing is an attractive choice
if the same data or its immediate neighbor is accessed multiple times. Since indirect
addressing happens via the ARs, similar to direct addressing the ARs initially need to
be loaded with address locations that are to be accessed. The advantage here is that
these registers can be incremented/decremented every cycle, while also having the option
of selecting which AR is to be accessed for the next operation.
Expanding the number of these registers is hence very helpful in numerous applications
where consecutive data in multiple locations is required for the algorithm. The best ex-
ample for such applications are image filters, where pixels within a 3x3 window of area are
3.3 Instruction opcodes and operation 23
Figure 3.4: Indirect addressing illustration
usually used in numerous algorithm. Fig. 3.4 illustrates the working of indirect addressing.
3.3 Instruction opcodes and operation
The instruction set consists of a total of 50 instructions, including load/store, branching,
data manipulation and SIMD instructions. Some instructions from the TMS32010 that
are not implemented here are the table read and write, LTD, IN and OUT instructions.
The Sections 3.3.1 and 3.3.2 list all the instructions and attempt to briefly describe their
operations respectively.
3.3.1 List of instructions and corresponding opcodes
Table 3.1 shows the opcodes for all instructions. It is to be noted that all branching
instructions will be followed by the jump address in the next instruction word, since it is
not explicitly indicated in the table.

















D9 D8 D7 D6 D5 D4 D3 D2 D1 D0
1 MPY 1 1 0 0 0 0 0 0 0 0 M D D D D D D D
3.3 Instruction opcodes and operation 24

















D9 D8 D7 D6 D5 D4 D3 D2 D1 D0
1 M * * 0 0 AR2 AR1 AR0
2 MPYK 1 1 1 0 1 0 0 0 0 0 K K K K K K K K
3 MAC 1 1 0 0 0 0 0 0 1 0 M D D D D D D D
1 M * * 0 0 AR2 AR1 AR0
4 OR 1 1 0 0 0 0 0 0 1 1 M D D D D D D D
1 M * * 0 0 AR2 AR1 AR0
5 XOR 1 1 0 0 0 0 0 1 0 0 M D D D D D D D
1 M * * 0 0 AR2 AR1 AR0
6 SPAC 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
7 SUB 1 1 0 0 1 S S S S S M D D D D D D D
1 M * * 0 0 AR2 AR1 AR0
8 SUBS 1 1 0 0 0 0 0 1 0 1 M D D D D D D D
1 M * * 0 0 AR2 AR1 AR0
9 ADD 1 1 0 1 0 S S S S S M D D D D D D D
1 M * * 0 0 AR2 AR1 AR0
10 ADDS 1 1 0 0 0 0 0 1 1 0 M D D D D D D D
1 M * * 0 0 AR2 AR1 AR0
11 AND 1 1 0 0 0 0 0 1 1 1 M D D D D D D D
1 M * * 0 0 AR2 AR1 AR0
12 BU 2 2 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 1
13 BANZ 2 2 0 0 0 1 1 0 0 0 M 0 0 0 0 0 0 0
3.3 Instruction opcodes and operation 25

















D9 D8 D7 D6 D5 D4 D3 D2 D1 D0
2 M * * 0 0 AR2 AR1 AR0
14 BGEZ 2 2 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
15 BGZ 2 2 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0
16 BLEZ 2 2 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1
17 BLZ 2 2 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0
18 BNZ 2 2 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1
19 BV 2 2 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0
20 BZ 2 2 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1
21 LAC 1 1 0 1 1 S S S S S M D D D D D D D
M * * 0 0 AR2 AR1 AR0
22 LACK 1 1 1 0 1 0 0 0 0 1 K K K K K K K K
23 LAR 1 1 1 1 0 0 0 AR AR AR M D D D D D D D
1 M * * 0 0 AR2 AR1 AR0
24 LARK 1 1 1 1 0 0 1 AR AR AR K K K K K K K K
25 LARP 1 1 1 0 1 0 0 0 1 1 0 0 0 0 0 K K K
26 LDP 1 1 0 0 0 0 1 0 0 1 M D D D D D D D
1 M * * 0 0 AR2 AR1 AR0
27 LDPK 1 1 1 0 1 0 0 1 0 0 K K K K K K K K
28 LT 1 1 0 0 0 0 1 0 1 0 M D D D D D D D
1 M * * 0 0 AR2 AR1 AR0
29 LTA 1 1 0 0 0 0 1 0 1 1 M D D D D D D D
3.3 Instruction opcodes and operation 26

















D9 D8 D7 D6 D5 D4 D3 D2 D1 D0
1 M * * 0 0 AR2 AR1 AR0
30 LTP 1 1 0 0 0 0 1 1 0 1 M D D D D D D D
1 M * * 0 0 AR2 AR1 AR0
31 LTS 1 1 0 0 0 0 1 1 1 0 M D D D D D D D
1 M * * 0 0 AR2 AR1 AR0
32 MAR 1 1 0 0 0 0 1 1 1 1 M D D D D D D D
1 M * * 0 0 AR2 AR1 AR0
33 PAC 1 1 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1
34 ROVM 1 1 0 0 0 0 0 0 0 1 0 0 1 0 1 1 1 1
35 SAC 1 1 1 0 0 S S S S S M D D D D D D D
1 M * * 0 0 AR2 AR1 AR0
36 SAR 1 1 1 1 0 1 0 AR AR AR M D D D D D D D
1 M * * 0 0 AR2 AR1 AR0
37 SOVM 1 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1
38 NOP 1 1 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1
39 ZAC 1 1 0 0 0 0 0 0 0 1 0 1 0 1 1 1 1 1
40 ZALH 1 1 0 0 0 1 0 0 1 1 M D D D D D D D
1 M * * 0 0 AR2 AR1 AR0
41 ZALS 1 1 0 0 0 1 0 1 0 0 M D D D D D D D
1 M * * 0 0 AR2 AR1 AR0
42 APAC 1 1 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1
3.3 Instruction opcodes and operation 27

















D9 D8 D7 D6 D5 D4 D3 D2 D1 D0
43 CMPS-
-IMD
1 1 0 0 0 1 0 1 0 1 M - - G/L - - - -
1 M * * G/L 0 AR2 AR1 AR0
44 SUBS-
-IMD
1 1 0 0 0 1 0 1 1 0 M D D D D D D D
1 M * * 0 0 AR2 AR1 AR0
45 ADDS-
-IMD
1 1 0 0 0 1 0 1 1 1 M D D D D D D D
1 M * * 0 0 AR2 AR1 AR0
46 POP 1 1 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 1
47 PUSH 1 1 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 1
48 RET 1 2 0 0 0 0 0 0 0 1 1 0 0 1 1 1 1 1
49 CALL 2 2 0 0 0 0 0 0 0 1 1 0 1 0 1 1 1 1
3.3.2 Description of the operation of each instruction
The operation of all instructions is listed in Table 3.2. It can be observed from the table
that each instruction has 4-stages of implementation, these are the four pipeline stages,
which are explained in detail in Chapter 4.
3.3 Instruction opcodes and operation 28
Table 3.2: List of instructions and their operations
Sl no. Instruction Formula Example
1 MPY Treg * [dma]-> Preg MPY dma
MPY {*|*+|*-}, next ARP
2 MPYK Treg * constant -> Preg MPYK constant
3 MAC Treg * [dma] -> Preg MAC dma
Acc + Preg -> Acc MAC {*|*+|*-}, next ARP
4 OR ( Acc | [dma] ) & 0xffffffff -> Acc OR dma
OR {*|*+|*-}, next ARP
5 XOR (Acc ^ [dma]) & 0xffffffff -> Acc XOR dma
XOR dma
6 SPAC Acc - Preg -> Acc SPAC
7 SUB Acc - [dma]*2shift -> Acc SUB dma, shift
SUB {*|*+|*-}, shift, next ARP
8 SUBS Acc - [dma] -> Acc SUBS dma
SUBS {*|*+|*-}, next ARP
9 ADD Acc + [dma]* 2shift -> Acc ADD dma, shift
ADD {*|*+|*-}, shift, next ARP
10 ADDS Acc + [dma] -> Acc ADDS dma
ADDS {*|*+|*-}, next ARP
11 AND (Acc & [dma]) & 0x0000ffff -> Acc AND dma
AND {*|*+|*-}, next ARP
12 BU [pma] -> PC BU pma
13 BANZ [Is AR(ARP) != 0 ]; Yes => [pma -> PC] BANZ pma
3.3 Instruction opcodes and operation 29
Table 3.2: List of instructions and their operations
Sl no. Instruction Formula Example
No => [PC + 2 -> PC] BANZ pma, {*|*+|*-}, next ARP
14 BGEZ [Is (ACC) >= 0]; Yes => [pma -> PC] BGEZ pma
No => [PC + 2 -> PC]
15 BGZ [Is (ACC) > 0]; Yes => [pma -> PC] BGZ pma
No => [PC + 2 -> PC]
16 BLEZ [Is (ACC) <= 0]; Yes => [pma -> PC] BLEZ pma
No => [PC + 2 -> PC]
17 BLZ [Is (ACC) < 0]; Yes => [pma -> PC] BLZ pma
No => [PC + 2 -> PC]
18 BNZ [Is (ACC) != 0]; Yes => [pma -> PC] BNZ pma
No => [PC + 2 -> PC]
19 BV [Is OV == 1]; Yes => [[pma -> PC] && [OV -> 0] BV pma
No => [PC + 2 -> PC]
20 BZ [Is ACC == 0]; Yes => [pma -> PC] BZ pma
No => [PC + 2 -> PC]
21 LAC [dma]*2shift -> Acc LAC dma, shift
LAC {*|*+|*-}, shift, next ARP
22 LACK constant -> Acc LACK 8-bit positive constant
23 LAR [dma] -> AR LAR AR, dma
LAR AR, {*|*+|*-}, next ARP
24 LARK constant -> AR LARK AR, 8-bit positive constant
25 LARP constant -> ARP LARP 3-bit constant
3.3 Instruction opcodes and operation 30
Table 3.2: List of instructions and their operations
Sl no. Instruction Formula Example
26 LDP [dma] & 0xff -> data page pointer LDP dma
LDP {*|*+|*-}, next ARP
27 LDPK constant -> data page pointer LDPK 8-bit constant
28 LT [dma] -> Treg LT dma
LT {*|*+|*-}, next ARP
29 LTA [dma] -> Treg LTA dma
Acc + Preg -> Acc LTA {*|*+|*-}, next ARP
30 LTP [dma] -> Treg LTP dma
Preg -> Acc LTP {*|*+|*-}, next ARP
31 LTS [dma] -> Treg LTS dma
Acc - Preg -> Acc LTS {*|*+|*-}, next ARP
32 MAR Modifies AR(ARP), and ARP as specified MAR dma
MAR {*|*+|*-}, next ARP
33 PAC Preg -> Acc PAC
34 ROVM 0 -> OVM status bit ROVM
35 SAC (Acc) *2shift -> [dma] SAC dma, shift
SAC {*|*+|*-}, shift, next ARP
36 SAR AR -> [dma] SAR AR, dma
SAR AR, {*|*+|*-}, next ARP
37 SOVM 1 -> overflow mode (OVM status bit) SOVM
38 NOP N/A N/A
39 ZAC 0 -> Acc ZAC
3.3 Instruction opcodes and operation 31
Table 3.2: List of instructions and their operations
Sl no. Instruction Formula Example
40 ZALH 0 -> Acc[15:0] ZALH dma
[dma] -> Acc[31:16] ZALH {*|*+|*-}, next ARP
41 ZALS 0 -> Acc[31:16] ZALS dma
[dma] -> Acc[15:0] ZALS {*|*+|*-}, next ARP
42 APAC Acc + Preg -> Acc APAC
43 CMPSIMD Acc[7:0] v/s dma[7:0] -> Acc[7:0] CMPSIMD dma
Acc[15:8] v/s dma[15:8] -> Acc[15:8] CMPSIMD {*|*+|*-}, next ARP
Acc[23:16] v/s dma[23:16] -> Acc[23:16]
Acc[31:24] v/s dma[31:24] -> Acc[31:24]
44 SUBSSIMD Acc[7:0] - (dma[7:0]) -> Acc[7:0] SUBSIMD dma
Acc[15:8] - (dma[15:8] ) -> Acc[15:8] SUBSIMD {*|*+|*-}, next ARP
Acc[24:16] - (dma[24:16] ) -> Acc[24:16]
Acc[31:24] - (dma[31:24] ) -> Acc[31:24]
45 ADDSSIMD Acc[7:0] + (dma[7:0]) -> Acc[7:0] ADDSIMD dma
Acc[15:8] + (dma[15:8] ) -> Acc[15:8] ADDSIMD {*|*+|*-}, next ARP
Acc[24:16] + (dma[24:16] ) -> Acc[24:16]
Acc[31:24] + (dma[31:24] ) -> Acc[31:24]
46 PUSH Acc -> Stack PUSH
47 POP Stack -> Acc POP
48 CALL PC -> Stack CALL L2
[pma] -> PC
49 RET PC -> Restore from Stack RET
Chapter 4
DSP Pipeline and Read/Write RAM
buffer wrapper implementation
Speed of computation has been the biggest challenge for any digital processor since its
invention, measured as the number of instructions that can be executed per second. In
the processor world, it has been established from years of experimentation and observation
there are only two factors which can significantly increase the speed computing. The first
factor being evolution of fabrication technology and the second being better computer
architecture, which has famously been marketed by Intel’s tick-tock processor model [21].
Fabrication technology understandably is of enormous complexity, since some of the topics
involved are chemical reactions, photonics, material science and device physics.
CPU architecture planning makes a significant impact in its speed enhancement. The
key factor in achieving computation speed is concurrency: performing as many operations
as possible, simultaneously. Concurrency though has two very important implementations,
namely pipelining, and parallelism. Although rooted in the same origins, often hard to
distinguish in practice, the two terms are discernibly different in their general approach
4.1 Pipeline implementation 33
[22].
Looking at a typical DSP MAC instruction, operands need to be fetched from the
memory, multiplied, while the previous product is added to the accumulator and address
register is post incremented/decremented. It is obvious that, to accomplish all these se-
quential functions it would take multiple clock cycles if the DSP is not pipelined [23].
While pipelining effectively speeds up the computation, programmability takes a serious
hit, if not done properly. This may also result in loosing some instruction cycles due to
data dependency hazards. When a programmer writes an assembly program, it is assumed
that every instruction completes before the next instruction begins. This must be ensured
by carefully designing the pipeline, such that the DSP should appear as if it were not
pipelined even though it is [23].
4.1 Pipeline implementation
In a good pipeline design, extensive pipelining with parallel architecture capability has to be
implemented,while ensuring programmabilty is not impacted due to dependency hazards.
This requires a system-level understanding of the DSP along with careful planning of each
pipeline stage [24].
In chapter 2, the DSP architecture had a brief look at parallelism with SIMD imple-
mentation in the ALU. Here, a clear description of pipelining is presented. This chapter
explains how exactly each task is split into pieces, while being tackled simultaneously.
4.1.1 Pipeline stages
Before planning the pipeline, it is very important to understand the sequence of events
happening in the DSP and data arrival time. Some very important points to remember
4.1 Pipeline implementation 34
Figure 4.1: Pipeline stages and implementation
are:
1. It takes at least one clock cycle to fetch data from the ROM.
2. The DSP requires one clock cycle to decode the instruction coming from the ROM.
3. Data transfer to and from the RAM also takes at least one clock cycle.
4. Most instructions in the DSP use direct memory access (DMA), hence most of them
will have to read data from the RAM.
5. Most instructions are to be executed within a single clock cycle.
Considering all the points mentioned above, the DSP pipeline has been divided into 4
stages. The name and function of each stage is described in Fig. 4.1.
1. The first stage is FETCH, where operand is fetched from the ROM. The program
counter is updated by this stage, starting as soon as the DSP is switched on and
reset.
4.1 Pipeline implementation 35
2. The second stage is DECODE, where the fetched instruction is decoded. This stage
is responsible for generating the RAM address and corresponding handshake signals,
if necessary.
3. The third stage isWAIT, where the DSP waits for data to be fetched from the RAM,
and feeds the read data into the execution unit. Also, updating AR and ARP is done
at this stage.
4. The fourth stage is EXECUTE, where all arithmetic operations are performed by the
DSP.
Fig.4.1.1 illustrates how the pipeline works in the DSP for the first 4 instructions. Assuming
that the DSP is reset at T0, it is observed that the DSP does not execute the first instruction
until T4. However, after T4, for every cycle there will be an output. Hence, technically all
single cycle instructions have a latency of 4 cycles, though they just take a single cycle to
execute.
4.1.2 Pipeline design for non-branching instructions
The pipelining of non-branching instruction is straight forward for all read instructions.
Write instructions however need to be modified slightly because of the way the RAM
memory works and DMA design of the DSP.
Fig.4.2 illustrates the pipeline operation for DMA read instructions, where all four
instructions are assumed to be DMA read instructions. The steps followed at each pipeline
stage of the implementation of DMA read instruction are listed below:
1. Fetch stage: The fetch stage here reads the instruction from the ROM and stores it
in an internal register for the next stage. It also increments the value of PC by one,
so that the next instruction is fetched in the following cycle.
4.1 Pipeline implementation 36
Figure 4.2: Pipeline example for memory read instructions
2. Decode stage: The decode stage here decodes whether the instruction is direct ad-
dressing or indirect addressing. It is also responsible to setup the appropriate hand-
shake signals for memory read and generate read address after decoding the instruc-
tion.
3. Wait stage: In the case of direct addressing, the wait stage does nothing. However,
this stage takes care of updating AR and ARP registers for indirect addressing.
4. Execute stage: By this stage, the memory read operation would have finished. Hence,
the fetched data is now used by the execution unit to compute. By the end of
this cycle, the output is either stored in the Product register or the accumulator,
depending on the instruction executed.
Fig. 4.3 illustrates the pipeline operation for 4 instructions, where the second and third
4.1 Pipeline implementation 37
are DMA write instructions, while the first and fourth are DMA read instructions. The
steps followed at each pipeline stage of the implementation of DMA write instruction are
listed below:
1. Fetch stage: This stage is the exact same as DMA read fetch stage. Hence, the
instruction is read from the ROM and passed on to the decode stage, while incre-
menting the value of PC by one for fetching the next instruction.
2. Decode stage: The instruction is decoded, setting up the write address according
to direct/indirect addressing. Appropriate handshake signals for memory write are
generated.
3. Wait stage: By the time this stage is completed, the appropriate write data needs
to be ready. To accomplish this, appropriate changes in the architecture have been
made at this stage to shift the output of the results of the fourth cycle of the pre-
vious instruction in the case of SAC or store accumulator instruction, in case the
previous instruction is an ALU operation. The updating of AR and ARP for indirect
addressing also it the responsibility of this stage.
4. Execute stage: During this stage, the memory write operation is performed by the
read/write RAM buffer wrapper.
4.1.3 Pipeline design for unconditonal branching instructions
As observed from Table 3.1, branching instructions are two-cycle instructions, and all
except return or RET require two-instruction words. From Chapter 3, it is also established
that the first instruction word contains the instruction op-code, while the second contains
the jump or branch address.
4.1 Pipeline implementation 38
Figure 4.3: Pipeline example for memory write instructions
4.1 Pipeline implementation 39
During the execution of two-word branch instruction, all stages of the pipeline are
stalled while reading the second word, since it is not an instruction. Appropriate changes
are made at every pipeline stage to make sure that the second word is not read as an
instruction, but stored as jump address.
There are two types of branching instructions, namely conditional and unconditional
branching instructions. Unconditional branching instructions are BU (branch uncondi-
tional), CALL (call) and RET (return), where branch must always be taken. Conditional
branching instructions are where the branching decision is made based on the valuation of
a condition.
Figure 4.4 shows a pipeline implementation example of unconditional branch instruc-
tion. The steps followed at each pipeline stage of the implementation of unconditional
branch instructions are listed below:
1. Fetch stage: The DSP during this stage reads the instruction from the ROM and
increments PC by one, just like all other instructions. However, the fetch is stalled
in order to read the jump or branch address.
2. Decode stage: The instruction is decoded, and call registers are accordingly modified
for call and return instructions, while the jump address is read from the program
memory and fed to the PC.
3. Wait stage: As the unconditional instruction has already been executed at this point,
this stage does almost nothing. However, for call and return instructions, the stack
pointer is stored and restored respectively at this stage.
4. Execute stage: This stage does performs no task since the instruction has already
accomplished its purpose.
4.1 Pipeline implementation 40
Figure 4.4: Pipeline example for unconditional branching
4.1.4 Pipeline design for conditional branching instructions
Before looking at the implementation of conditional branching, it is necessary to understand
how the instruction works in practice. Since almost all conditional branching instructions
rely either on status flags resulting from an ALU operation, or the ALU result itself, timing
and data dependency wise, the worst-case scenario of the previous instruction being an
ALU operation is assumed before approaching to design the pipeline stages.
With the assumption that the previous instruction is an ALU operation, the outcome
of the branching condition is not known until the operation is complete. From the pipeline
design for ALU operations, or DMA read operations, discussed in Section 4.1.2, it is clear
that execution happens only at the last pipeline stage. Hence, the branching decision
cannot be made until after the fourth cycle of the previous instruction has been executed.
However, after stalling the pipeline to read the branch address, before the third stage of
the pipeline of the conditional branch instruction, the DSP should already know where to
4.1 Pipeline implementation 41
fetch the next instruction from. In other words, during the second pipeline stage, the PC
needs to be updated to the next program address to fetch from. This results in a dilemma,
as the branching decision needs to be made at the second stage of the pipeline, however
the decision is not available until the fourth stage.
There are two solutions to this problem:
Solution 1: Decide to not take the branch, and make necessary changes if the condition
turns out to be true.
Solution 2: Predict the branch, and pay the penalty of two cycles if wrong by making
necessary changes in the case of a wrong prediction.
Though solution 2 is a better option and has numerous methods of execution, even the
simplest branch predictor requires a lot of additional hardware and planning. To keep the
DSP design simple, solution 1 is considering in this design.
Figures 4.5 and 4.6 illustrate both cases of the working of pipeline for conditional branch
instructions, the first case where the condition evaluates to be false, and the second case
where the condition evaluates to be true. In both cases, the first instruction is assumed
to be the evaluation instruction, hence the branch/jump condition is evaluated based on
its outcome. The second instruction is the conditional branch instruction, while the last
two instructions are unconditional instructions.The pipeline plan of action for each stage
is listed below:
1. Fetch stage: The DSP during this stage reads the instruction from the ROM and
increments PC by one, just like all other instructions. Fetching of the next instruction
is stalled to read the jump or branch address.
2. Decode stage: The instruction is decoded, and the jump address is read from the
program memory and stored until execution stage. However, the value of PC is
4.1 Pipeline implementation 42
Figure 4.5: Pipeline implementation example for conditional branch instruction, when
condition is false
incremented by one for the pipeline to work smoothly, and not waste any cycles in
case the branch evaluates to be false.
3. Wait stage: This stage does performs no task.
4. Execute stage: The branching condition is evaluated at this stage. In case the
condition evaluates false, no change is made to the pipeline flow. However, if the
condition evaluates to be true, the value of PC is updated to the jump address,
resulting in the wastage of the computations in the previous two cycles. It is very
important to undo any modifications done in the previous two cycles and also stall
the pipeline accordingly, to make sure no unwanted data is propagated, in case the
jump is taken.
4.2 Read/write RAM buffer wrapper 43
Figure 4.6: Pipeline implementation example for conditional branch instruction, when
condition is true
4.2 Read/write RAM buffer wrapper
DSPs have much higher memory bandwidth and use lot more memory-to-memory instruc-
tions, when compared to traditional processors [25]. While most DSPs tackle this problem
using small, fast and simple parallel memory banks, it is very difficult to design compilers
and the power consumption increases significantly for such DSPs [18]. Since it has been
established that data memory access is very important in DSPs, it is crucial to ensure that
memory access is quick and effective, while keeping the power consumption low. Hence,
both, data and address memories have been clocked at the same speed as the DSP, in an
effort to keep the total power consumption low.
Taking a brief look at the pipeline implementation described in the Section 4.1.2, it can
be observed that there exists a huge problem in the case of RAM memory writes, since the
write data is provided a cycle after write address generation. Since the pipelining and data
memory addressing of the DSP design implemented in this paper is very different from
4.2 Read/write RAM buffer wrapper 44
TMS32010, even though the ISA is almost the same, this problem is not observed in the
case of TMS32010. This is mainly because TMS32010 had its memory clocked to at least
twice the speed of the DSP itself. This is evident from some of the instructions in its ISA,
which have obviously not been implemented in this DSP. A good example for this is the
LTA instruction, which featured multiple memory transactions within a single clock cycle.
[20]
Section 4.2.1 describes what problems were faced due to clocking the memory at the
same speed as the processor, and Section 4.2.2 describes how the problem has been resolved
using the read/write RAM buffer wrapper.
4.2.1 RAM read/write problem description
Looking at the pipeline implementation in the case of data memory or DMA write opera-
tions in Section 4.1.2, it is observed that write address and handshaking signal generation
happens at stage 2 or decode stage, while the write data is sent to the data memory in the
next stage, which is stage 3 or the wait stage. However, the RAM requires the address,
handshaking signals, along with the data to be written, all within the same cycle. This is
not possible with the pipeline design implemented in this paper, since the write data may
be computed in last stage of the pipeline of the previous instruction.
Hence, since is in not possible to make sure that the RAM receives the write data at
the correct cycle, a buffer layer has been designed to effectively facilitate the data-flow.
The buffer layer is simple in design and implementation, using minimal hardware required
to serve the purpose, since other easy solutions involve using a faster clock for the memory,
leading to an increase in power consumption.
4.2 Read/write RAM buffer wrapper 45
Figure 4.7: Read/write RAM buffer wrapper state machine
4.2.2 Design and implementation of read/write buffer wrapper
The wrapper is designed with a simple goal: delay the data memory write operation by a
single cycle, while seamlessly providing the correct data whenever necessary. Translating
this to a plan of action, the following procedures were followed:
1. For every write operation, store the address and corresponding data.
2. For every read, check the address. If it matches the buffer address, transfer the
contents of the buffer data as the output to the DSP. Else, make the necessary
arrangements to fetch the data directly from the RAM, and send it to the DSP.
3. For every other write operation following the first, store the buffer data onto RAM
and update the buffer address and data with the corresponding new values.
Fig. 4.7 illustrates the state-machine for the read/write buffer wrapper. The state machine
consists of a total of four states depending on the type of operation involved. Since write
is implemented as a two-stage operation in the pipeline, the following instruction also
needs to be accounted for within the read/write buffer wrapper. A brief explanation of the
implementation is described below, detailing the operation of each state:
4.2 Read/write RAM buffer wrapper 46
1. Read state: Read state is also the idle state. If the read address matches the buffer
address register contents, the buffer data is transferred to the DSP. However, if the
read address is different from the buffer address register contents, the required data
is fetched from RAM and transferred to the DSP within the next clock cycle.
2. Write state: A single bit flag is used to keep track of whether the buffer data has been
transferred onto the RAM or not. Every time a new data arrives, if data is present
in the data buffer register, it is transferred to the RAM address corresponding to the
buffer address, which is retrieved from the buffer address register. This is followed
by storing the write address in the buffer address register.
3. RAW state: In the RAW state or read-after-write state, the write data is stored in
the buffer data register. Also, all functions in the read state are performed here as
well.
4. WAW state: In the WAW state or the write-after-write state, the write data is directly
sent to the RAM, at the address location corresponding to the buffer address register.
The new write address is now stored in the buffer address register.
Chapter 5
Median filter design
Image processing and filtering is an area where DSPs have been used extensively since
their invention. In the recent years however, more complex image processing have been
handled by GPUs or graphical processsing units mainly due to their hardware parallelism
and enormous amount of data required to be processed. However, numerous image filtering
applications are still use DSPs, but with multiprocessor type configuration.
Taking a brief look at image data, it is usually represented by the amount of Red, Green
and Blue (RGB) colors over a fixed area of a preset number of very small points called
pixels. The common representation of a standard dimension image is 24-bit RGB values
per pixel, over an area of (720 x 576) pixels. Most simple DSPs are 16-bit fixed-point
architectures, hence to handle image data they would require two data words per pixel.
Expanding the data word to at least 24-bits hence could result in further applications in
image handling and processing.
The following sections of this chapter present a simple application of the designed DSP,
to showcase the merits of its enhancements over the TMS32010 by implementing a median
filter. Section 5.1 presents an overview of the median filter by explaining how a median filter
5.1 Median filter overview 48
works. Section 5.2 discusses the median filter algorithm design and its implementation.
5.1 Median filter overview
Median filters are non-linear digital filters used widely to get rid of salt and pepper noise.
The implementation of the median is quite is simple and straight forward. Considering a
3x3 window of pixels of an image, the following steps are followed to find the median:
Step 1: Arrange the pixels one after the other.
Step 2: Rearrange the pixels in an ascending or descending order.
Step 3: Pick the central value of the arranged pixels, which is the fifth pixel in this
case. This will be the median.
While the median filter implementation looks like a simple two-step process, it takes a
significant amount of effort to arrange the pixels in ascending or descending order, as every
pixel needs to be compared to every other pixel, and this needs to be done sequentially to
keep track of the order of their arrangement.
Fig.5.1 illustrates the working of a median filter. In the figure, P1, P2, P3, P4, P5, P6,
P7, P8 and P9 are pixel values of the 3x3 window from the image. After step 2, note that
the new pixel values P1‘, P2‘, P3‘, P4‘, P5‘, P6‘, P7‘, P8‘ and P9‘ indicated in the figure
represent the rearranged pixel values.
5.2 Median filter design and implementation
The median filter algorithm design for the 3 × 3 pixel window is explained in figure 5.2.
The figure self-explanatory and hence clearly explains the algorithm which has been im-
plemented in DSP assembly language.
5.2 Median filter design and implementation 49
Figure 5.1: Median filter working illustration
Figure 5.2: Median filter algorithm
5.2 Median filter design and implementation 50
The implementation of this algorithm on an image is done by moving the 3×3 window
from the top-left corner of the image across all columns, and soon as the median for the
first row has been computed, the window is moved to the next row. This is repeated
until the last row is computed. Fig. 5.3 illustrates the implementation of algorithm on an
image. The first part of the figure shows the 3 × 3 window placement while computing
the first median, the second part shows the window placement while computing the second
median and the third part shows the window placement while computing the median for
the second row. This window placement pattern is repeated until all the medians are
computed. It is worth noting that the output image will lose two rows and two columns,
with this implementation.
5.2 Median filter design and implementation 51
Figure 5.3: Median filter algorithm implementation illustration for a 3× 3 window
Chapter 6
Results
This chapter discusses the results from this project, as well as future work that could be
completed.
6.1 Results
The DSP design was synthesized using Synopsys Design Compiler at 180 nm technology
nodes from TMSC. Cadence Virtuoso Suite was used for design, debuging and simulation
of the design. Table 6.1 gives the synthesis results for the post-scan netlist of the design,
when synthesized at 50MHz.
Due to time constriants, it was not possible to fully verify the DSP design. The DSP
however has been verified to work at gate level, where most of its instructions and numerous
branching dependencies have been tested. The basic median filter algorithm was designed
in assembly language and verified to work on the DSP. Details including the Assembly
code that was used for testing the DSP have been included in Appendix A.
6.1 Results 53
Table 6.1: Synthesis results
Noncombinational area 181298
Area Combinational area 167021
(µm2) Buf/ Inv area 9027
Total cell area 348320
Internal Power 9.4111
Power Switching Power 1.6481
(mW ) Leakage Power 1.4210
Total 11.0607
Timing Data arrival time 18.1474
(ns) Slack 1.4799
DFT Coverage Test coverage 99.92%
Chapter 7
Conclusions and future work
7.1 Conclusion
The DSP design has been successfully implemented, verified for the instructions mentioned
in Appendix A and synthesized at 50MHz. The design was kept simple, since its ISA and
architecture have been based on the TMS32010. Power efficiency was achieved by running
the memory at the same speed as the DSP. A median filter algorithm was designed in
assembly, simulated at gate-level and verified to work on the DSP within 100 instructions,
demonstrating that the enhanced SIMD instructions could be used for median filter compu-
tation, hence proving that the DSP is capable of handling simple multimedia applications.
7.2 Future work
Since the DSP was designed in a very short span of time, testing the DSP thoroughly
could not be completed. It is necessary to completely test the DSP before attempting to
use it in an application, hence this would be the the first thing to work on. The median
7.2 Future work 55
filter algorithm described in chapter 5, though successfully implemented, could not be
tested with a noisy image due to time constraints. Doing this would demonstrate the
capabilities of the DSP, and a comparison of the results with a similar implementation on
the TMS32010 would prove the claims presented in the paper.
Designing a compiler would certainly be necessary and the next step to work on. An-
other interesting enhancement would be the design of a parallel-processor system using
mulitple DSPs for advanced imaging applications.
References
[1] B. Marr, “Big data: 20 mind-boggling facts everyone must read.”
[2] T. Jamil, “Risc versus cisc,” Ieee Potentials, vol. 14, no. 3, pp. 13–16, 1995.
[3] W. P. Hays, “Dsps: Back to the future,” Queue, vol. 2, no. 1, p. 42, 2004.
[4] D. Zuras, M. Cowlishaw, A. Aiken, M. Applegate, D. Bailey, S. Bass, D. Bhandarkar,
M. Bhat, D. Bindel, S. Boldo et al., “Ieee standard for floating-point arithmetic,”
IEEE Std 754-2008, pp. 1–70, 2008.
[5] S. W. Smith, The scientist and engineer’s guide to digital signal processing. California
Technical Pub., 1999.
[6] C. Inacio and D. Ombres, “The dsp decision: Fixed point or floating?” IEEE Spec-
trum, vol. 33, no. 9, pp. 72–74, 1996.
[7] S. Smith, Digital signal processing: a practical guide for engineers and scientists,
S. Smith, Ed. Newnes, 2013.
[8] G. Frantz, “Signal core: A short history of the digital signal processor,” IEEE Solid-
State Circuits Magazine, vol. 4, no. 2, pp. 16–20, 2012.
References 57
[9] E. J. Tan and W. B. Heinzelman, “Dsp architectures: past, present and futures,”
ACM SIGARCH Computer Architecture News, vol. 31, no. 3, pp. 6–19, 2003.
[10] A. Abnous and N. Bagherzadeh, “Pipelining and bypassing in a vliw processor,” IEEE
Transactions on Parallel and Distributed Systems, vol. 5, no. 6, pp. 658–664, 1994.
[11] J. Glossner, J. Moreno, M. Moudgill, J. Derby, E. Hokenek, D. Meltzer, U. Shvadron,
and M. Ware, “Trends in compilable dsp architecture,” in Signal Processing Systems,
2000. SiPS 2000. 2000 IEEE Workshop on. IEEE, 2000, pp. 181–199.
[12] C. Choo, J. Chung, J. Fong, and S. E. Cheung, “Implementation of texas instruments
tms32010 dsp processor on altera fpga,” in Global Signal Processing Expo & Conf. San
Jose State University, 2004.
[13] J. L. Hennessy and D. A. Patterson, Computer architecture: a quantitative approach.
Elsevier, 2011.
[14] S. L. Harris and D. M. Harris, Digital Design and Computer Architecture: ARM
Edition. Morgan Kaufmann, 2016.
[15] A. David and H. John, “Computer organization and design: the hardware/software
interface,” San mateo, CA: M organ Kaufmann Publishers, vol. 1, p. 998, 2005.
[16] K. Ngan, A. Kassim, and H. Singh, “Parallel image-processing system based on the
tms32010 digital signal processor,” IEE Proceedings E (Computers and Digital Tech-
niques), vol. 134, no. 2, pp. 119–124, 1987.
[17] D. Holburn and I. Sommerville, “A high-speed image processing system using the
tms32010,” Software & Microsystems, vol. 4, no. 5, pp. 102–108, 1985.
References 58
[18] E. A. Lee, “Programmable dsp architectures. i,” IEEE ASSP Magazine, vol. 5, no. 4,
pp. 4–19, 1988.
[19] G. Araujo, A. Sudarsanam, and S. Malik, “Instruction set design and optimizations
for address computation in dsp architectures,” in Proceedings of the 9th international
symposium on System synthesis. IEEE Computer Society, 1996, p. 105.
[20] T. Instruments and P. Strzelecki, TMS32010 User’s Guide. Texas Instruments, 1983.
[21] T. Jain and T. Agrawal, “The haswell microarchitecture-4th generation processor,”
International Journal of Computer Science and Information Technologies, vol. 4, no. 3,
pp. 477–480, 2013.
[22] P. M. Kogge, The architecture of pipelined computers. CRC Press, 1981.
[23] E. A. Lee, “Programmable dsp architectures. ii,” IEEE ASSP Magazine, vol. 6, no. 1,
pp. 4–14, 1989.
[24] E. Lee and D. Messerschmitt, “Pipeline interleaved programmable dsp’s: Architec-
ture,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 35, no. 9,
pp. 1320–1333, 1987.
[25] N. H. Weste and D. Harris, CMOS VLSI design: a circuits and systems perspective.
Pearson Education India, 2015.
Appendix I
Source Code
I.1 RTL source code
I.1.1 DSP top level module
1 //
////////////////////////////////////////////////////////////////////////////
2 // Author : Shashank Simha
3 // Date : 12/12/2017
4 // Un ive r s i ty : Rochester I n s t i t u t e o f Technology
5 // Desc r ip t i on : This i s a part o f the DSP implemented f o r grad p r o j e c t
6 //
////////////////////////////////////////////////////////////////////////////
7 module DSP_Version1 (
8 r e s e t ,





14 SW_pin , Display_pin ,
15 DM_out, CEN, wr_data , DM_Addr, DM_in, OEN, // ram_buffer




I.1 RTL source code I-2
20 input [ 4 : 0 ] SW_pin ; // Four sw i t che s and one push−button
21 output [ 7 : 0 ] Display_pin ; // 8 LEDs
22
23 input
24 r e s e t , // system r e s e t
25 c l k ; // system c lock
26
27 input
28 scan_in0 , // t e s t scan mode data input
29 scan_en , // t e s t scan mode enable
30 test_mode ; // t e s t mode s e l e c t
31
32 output
33 scan_out0 ; // t e s t scan mode data output
34
35 // ///////////RAM port s ///////////////////
36 output [ 1 4 : 0 ] DM_Addr;
37 output [ 3 1 : 0 ] DM_in;
38 input [ 3 1 : 0 ] DM_out ;
39 output reg wr_data , OEN, CEN;
40 // ///////////ROM port s ///////////////////
41 input [ 1 5 : 0 ] PM_out ;





47 //−− 1 ISA Parameters
48 //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
49 parameter [ 7 : 0 ] MPY = 8 ' b00000000 ; // 1 .
50 parameter [ 7 : 0 ] MPYK = 8 ' b10100000 ; // 2 .
51 parameter [ 7 : 0 ] MAC = 8 ' h02 ; // 3 .
52 parameter [ 7 : 0 ] OR = 8 ' h03 ; // 4 .
53 parameter [ 7 : 0 ] XOR = 8 ' h04 ; // 5 .
54 parameter [ 1 5 : 0 ] SPAC = 16 ' h0100 ; // 6 .
55 parameter [ 2 : 0 ] SUB = 3 'h1 ; // 7 .
56 parameter [ 7 : 0 ] SUBS = 8 ' h05 ; // 8 .
57 parameter [ 2 : 0 ] ADD = 3 'h2 ; // 9 .
58 parameter [ 7 : 0 ] ADDS = 8 ' h06 ; // 10 .
59 parameter [ 7 : 0 ] AND = 8 ' h07 ; // 11 .
60 parameter [ 1 5 : 0 ] BU = 16 'h010F ; // 12 .
61 parameter [ 1 5 : 0 ] BGEZ = 16 ' h0101 ; // 14 .
62 parameter [ 1 5 : 0 ] BGZ = 16 ' h0102 ; // 15 .
63 parameter [ 1 5 : 0 ] BLEZ = 16 ' h0103 ; // 16 .
64 parameter [ 1 5 : 0 ] BLZ = 16 ' h0104 ; // 17 .
65 parameter [ 1 5 : 0 ] BNZ = 16 ' h0105 ; // 18 .
66 parameter [ 1 5 : 0 ] BV = 16 ' h0106 ; // 19 .
67 parameter [ 1 5 : 0 ] BZ = 16 ' h0107 ; // 20 .
68 parameter [ 2 : 0 ] LAC = 3 'h3 ; // 21 .
I.1 RTL source code I-3
69 parameter [ 7 : 0 ] LACK = 8 ' b10100001 ; // 22 .
70 parameter [ 7 : 0 ] LAR = 5 ' b11000 ; // 23 .
71 parameter [ 7 : 0 ] LARK = 5 ' b11001 ; // 24 .
72 parameter [ 7 : 0 ] LARKH = 5 ' b11011 ; // 25 .
73 parameter [ 7 : 0 ] LARP = 8 ' b10100011 ; // 26 .
74 parameter [ 7 : 0 ] LDP = 8 ' h09 ; // 27 .
75 parameter [ 7 : 0 ] LDPK = 8 ' b10100100 ; // 28 .
76 parameter [ 7 : 0 ] LT = 8 ' h0a ; // 29 .
77 parameter [ 7 : 0 ] LTA = 8 'h0b ; // 30 .
78 parameter [ 7 : 0 ] LTD = 8 ' h0c ; // 31 .
79 parameter [ 7 : 0 ] LTP = 8 'h0d ; // 32 .
80 parameter [ 7 : 0 ] LTS = 8 ' h0e ; // 33 .
81 parameter [ 7 : 0 ] MAR = 8 ' h0f ; // 34 .
82 parameter [ 1 5 : 0 ] PAC = 16 'h011F ; // 35 .
83 parameter [ 1 5 : 0 ] ROVM = 16 'h012F ; // 36 .
84 parameter [ 2 : 0 ] SAC = 3 'h4 ; // 37 .
85 parameter [ 7 : 0 ] SAR = 5 ' b11010 ; // 38 .
86 parameter [ 1 5 : 0 ] SOVM = 16 'h013F ; // 39 .
87 parameter [ 7 : 0 ] TBLR = 8 ' h11 ; // 40 .
88 parameter [ 7 : 0 ] TBLW = 8 ' h12 ; // 41 .
89 parameter [ 1 5 : 0 ] NOP = 16 'h014F ; // 42 .
90 parameter [ 1 5 : 0 ] ZAC = 16 'h015F ; // 43 .
91 parameter [ 7 : 0 ] ZALH = 8 ' h13 ; // 44 .
92 parameter [ 7 : 0 ] ZALS = 8 ' h14 ; // 45 .
93 parameter [ 1 5 : 0 ] APAC = 16 'h016F ; // 46 .
94 parameter [ 6 : 0 ] CMPSIMD = 7 ' b0001101 ; // 47 .
95 parameter [ 7 : 0 ] SUBSIMD = 8 ' h16 ; // 48 .
96 parameter [ 7 : 0 ] ADDSIMD = 8 ' h17 ; // 49 .
97 parameter [ 8 : 0 ] BANZ = 8 ' h18 ; // 13 .
98 parameter [ 1 5 : 0 ] PUSH = 16 'h018F ; // 50 .
99 parameter [ 1 5 : 0 ] POP = 16 'h017F ; // 51 .
100 parameter [ 1 5 : 0 ] CALL = 16 'h01AF ; // 52 .
101 parameter [ 1 5 : 0 ] RET = 16 'h019F ; // 53 .
102
103 //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
104 //−− 2 In t e r na l r e g i s t e r s (& wire s )
105 //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
106 reg [ 1 5 : 0 ] AR [ 7 : 0 ] ; // 7 Aux i l i a ry Reg i s t e r s o f width 16−b i t s each
107 reg [ 2 : 0 ] ARP, PARP; // 2 Reg i s t e r s to s t o r e cur rent and prev ious AR
po in t e r s
108
109 reg [ 7 : 0 ] DPPTR;
110
111 reg [ 3 1 : 0 ] acc , Preg ;
112 reg [ 1 5 : 0 ] Treg ;
113
114 wire [ 1 5 : 0 ] SR_wire ;
115 reg [ 1 5 : 0 ] SR; // Status r e g i s t e r (4∗CNVZ) f o r SIMD in s t u c t i o n s
116
I.1 RTL source code I-4
117 reg [ 4 : 0 ] SP ; // Stack po in t e r
118 reg [ 4 : 0 ] CSP; // Ca l l s tack po in t e r ;
119 //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
120 //−− 3 P ip e l i n e r e g i s t e r s
121 //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
122 reg [ 4 : 0 ] sreg2 , sreg3 , s r eg4 ; // Used to temporar i ly Sh i f t va lue
between p i p e l i n e s t ag e s
123 reg [ 1 5 : 0 ] JAddr2 , JAddr3 , JAddr4 , JAddr ; // Used to temporar i ly s t o r e Jump
Address between p i p e l i n e s t ag e s
124 reg [ 3 1 : 0 ] temp_acc ;
125 reg [ 3 1 : 0 ] breg ;
126 reg [ 1 5 : 0 ] PAR [ 7 : 0 ] ; // 7 Aux i l i a ry Reg i s t e r s o f width 16−b i t s each
127 reg cnt ; //
128 reg JFlag , JFlag_del , JFlag_c , JFlag_uc ;
129 reg stall_mc1 , stall_mc2 , stall_mc3 , stall_mc4 , s t a l l_uc ; //
130 reg [ 1 5 : 0 ] IR2 , IR3 , IR4 , IR_del ;
131 reg J_detect ;
132 reg [ 1 5 : 0 ]DM_Addr_reg ;
133 //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
134 //−− 4 Memory c l o ck setup
135 //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
136 // a s s i gn clk_n <=~c lk ;
137 reg get_DMAddr ;
138
139
140 reg [ 3 1 : 0 ] s tack [ 3 1 : 0 ] ; // 32 stack r e g i s t e r s
141 reg [ 1 5 : 0 ] c a l l_s tack [ 3 1 : 0 ] ; // 32 c a l l s tack r e g i s t e r s
142 reg [ 1 5 : 0 ] call_SR [ 3 1 : 0 ] ; // 32 c a l l SR r e g i s t e r s
143
144





150 reg DM_cnt ;
151
152 wire [ 3 1 : 0 ] s1_in ;
153 wire [ 3 1 : 0 ] s1_out ;
154
155 wire [ 3 1 : 0 ] alu_out ;
156
157 wire [ 1 5 : 0 ] mult_2 ;
158 wire [ 3 1 : 0 ] r e s u l t ;
159
160 wire [ 3 1 : 0 ] Preg_wire ;
161 reg [ 3 1 : 0 ] s1_in_reg , bu f f ;
162 reg [ 1 5 : 0 ] buff_mult_2 ;
163 reg updated_AR ;
I.1 RTL source code I-5
164
165 wire branch_predict ;
166 reg check_condit ion ;
167
168 a s s i gn branch_predict = ( IR4 [15:0]==BZ) ? ( ( acc==0)? 1 : 0 ) :
169 ( IR4 [15:0]==BV) ? ( (SR[15]==0) ? 1 : 0 ) :
170 ( IR4 [15:0]==BNZ) ? ( ( acc !=0) ? 1 : 0 ) :
171 ( IR4 [15:0]==BLZ) ? ( ( acc< 0) ? 1 : 0 ) :
172 ( IR4 [15:0]==BLEZ) ? ( ( acc<=0)? 1 : 0 ) :
173 ( IR4 [15:0]==BGEZ) ? ( ( acc>=0)? 1 : 0 ) :
174 ( IR4 [15:0]==BGZ) ? ( ( acc> 0) ? 1 : 0 ) :
175 ( ( IR4 [15:8]==BANZ) && ( IR4 [6 :0 ]==0) ) ? ( (AR[ARP]==0)?
1 : 0 ) :
176 0 ;
177
178 a s s i gn DM_Addr= (get_DMAddr) ? ( IR2 [ 7 ] == 1? AR[ARP] [ 1 4 : 0 ] : {DPPTR, IR2
[ 6 : 0 ] } ) :
179 (updated_AR) ? AR[ARP] [ 1 4 : 0 ] :
180 DM_Addr_reg ;
181
182 a s s i gn mult_2= ( ( IR4 [15:8]==MAC) | | ( IR4 [15:8]==MPY) ) ? buff_mult_2 :
183 ( IR4 [15:8]==MPYK) ? {8 'd0 , IR4 [ 7 : 0 ] } :
184 0 ;
185
186 a s s i gn DM_in = ( IR4 [ 1 5 : 1 3 ] == SAC) ?
acc :
187 ( IR4 [ 1 5 : 1 1 ] == SAR) ?
AR[ IR4 [ 1 0 : 8 ] ] :
188 32 'h0 ;
189
190 a s s i gn s1_in = ( ( IR4 [15:8]==MAC) | | ( IR4 [15:0]==SPAC) | | ( IR4 [15:0]==APAC) ) ?
Preg :
191 ( ( IR4 [15:8]==OR) | | ( IR4 [15:8]==XOR) | | ( IR4 [15:8]==SUBS) | | ( IR4
[15:8]==AND) | | ( IR4 [15:8]==SUBSIMD) | | ( IR4 [15:8]==ADDSIMD)
| | ( IR4 [15:9]==CMPSIMD)
192 | | ( IR4 [15:8]==LTA) | | ( IR4 [15:8]==LTS) | | ( IR4 [15:13]==SUB) | | (






197 //−− 5 I n s t a n t i a t i o n o f components
198 //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
199 mu l t i p l i e r m1 ( . scan_in0 ( scan_in0 ) ,
200 . scan_out0 ( scan_out0 ) ,
201 . scan_en ( scan_en ) ,
202 . test_mode ( test_mode ) ,
203 . a ( Treg ) ,
I.1 RTL source code I-6
204 . b (mult_2 ) ,
205 . ov (SR [ 1 5 ] ) ,
206 . product ( Preg_wire ) ) ;
207
208 // Sh i f t s input operand
209 sh i f t e r_ inpu t s1 ( . scan_in0 ( scan_in0 ) ,
210 . scan_out0 ( scan_out0 ) ,
211 . scan_en ( scan_en ) ,
212 . test_mode ( test_mode ) ,
213 . s h i f t_ in ( s1_in ) ,
214 . opcode ( alu_opcode ) ,
215 . sh i f t_out ( s1_out ) ) ;
216
217 ALU alu1 ( . scan_in0 ( scan_in0 ) ,
218 . scan_out0 ( scan_out0 ) ,
219 . scan_en ( scan_en ) ,
220 . test_mode ( test_mode ) ,
221 //
222 . a ( acc ) ,
223 . b ( s1_out ) ,
224 . opcode ( alu_opcode ) ,
225 . r e s u l t ( alu_out ) ,
226 . car ry (SR_wire [ 1 5 ] ) ,
227 . negat ive ( SR_wire [ 1 4 ] ) ,
228 . ov (SR_wire [ 1 3 ] ) ,
229 . ze ro (SR_wire [ 1 2 ] ) ,
230 . carry_2 (SR_wire [ 1 1 ] ) ,
231 . negative_2 (SR_wire [ 1 0 ] ) ,
232 . ov_2 (SR_wire [ 9 ] ) ,
233 . zero_2 (SR_wire [ 8 ] ) ,
234 . carry_3 (SR_wire [ 7 ] ) ,
235 . negative_3 (SR_wire [ 6 ] ) ,
236 . ov_3 (SR_wire [ 5 ] ) ,
237 . zero_3 (SR_wire [ 4 ] ) ,
238 . carry_4 (SR_wire [ 3 ] ) ,
239 . negative_4 (SR_wire [ 2 ] ) ,
240 . ov_4 (SR_wire [ 1 ] ) ,
241 . zero_4 (SR_wire [ 0 ] )
242 ) ;
243
244 // Sh i f t s output operand
245 sh i f t e r_output s2 ( . scan_in0 ( scan_in0 ) ,
246 . scan_out0 ( scan_out0 ) ,
247 . scan_en ( scan_en ) ,
248 . test_mode ( test_mode ) ,
249 . s h i f t_ in ( alu_out ) ,
250 . opcode ( next_opcode ) ,
251 . sh i f t_out ( r e s u l t ) ) ;
252 //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
I.1 RTL source code I-7
253 //−− 6 Code s t a r t s here . . . .
254 //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
255 always@ ( posedge c l k or posedge r e s e t )
256 begin
257 i f ( r e s e t )
258 begin
259 PC <= 16 'h0 ;
260 AR[ 0 ] <= 16 'h0 ;AR[ 1 ] <= 16 'h0 ;AR[ 2 ] <=16'h0 ;AR[ 3 ] <=16'h0 ;
261 AR[ 4 ] <= 16 'h0 ;AR[ 5 ] <=16'h0 ;AR[ 6 ] <=16'h0 ;AR[ 7 ] <=16'h0 ;
262 stal l_mc1 <=0; stal l_mc2 <=1; stal l_mc3 <=1; stal l_mc4 <=1;
263 IR2 <=16'h0 ; IR3 <=16'h0 ; IR4 <=16'h0 ; DPPTR <=8'd0 ;
264 wr_data <=1'b1 ; //Read mode
265 ARP <=3'd0 ;
266 SP <=5'd0 ;
267 CSP <=5'd0 ;
268 CEN <=1'b0 ;OEN <=1'b0 ;
269 JFlag <=1'b0 ; JFlag_uc <=1'b0 ; JFlag_c <=1'b0 ;
270 Treg <=16'h0 ;
271 Preg <=32'h0 ;
272 acc <=32'h0 ;
273 breg <=32'h0 ;
274 alu_opcode <= 16 'h0 ;
275 mpy_opcode <= 16 'h0 ;
276 next_opcode <= 16 'h0 ;
277 DM_Addr_reg <= 15 'h0 ;
278 // //////////////////
279 updated_AR<= 0 ;
280 // /////
281 bu f f <= 32 'h0 ;
282 buff_mult_2 <= 32 'h0 ;
283 s ta l l_uc <=1'h0 ;
284 cnt <=0;
285 JAddr <=16'h0 ; JAddr2<=16'h0 ; JAddr3<=16'h0 ; JAddr4<=16'h0 ;
286 //////
287 stack [ 0 ] <=32'h0 ; s tack [ 1 ] <=32'h0 ; s tack [ 2 ] <=32'h0 ; s tack [ 3 ]
<=32'h0 ;
288 stack [ 4 ] <=32'h0 ; s tack [ 5 ] <=32'h0 ; s tack [ 6 ] <=32'h0 ; s tack [ 7 ]
<=32'h0 ;
289 stack [ 8 ] <=32'h0 ; s tack [ 9 ] <=32'h0 ; s tack [ 1 0 ] <=32'h0 ; s tack
[ 1 1 ] <=32'h0 ;
290 stack [ 1 2 ] <=32'h0 ; s tack [ 1 3 ] <=32'h0 ; s tack [ 1 4 ] <=32'h0 ; s tack
[ 1 5 ] <=32'h0 ;
291 stack [ 1 6 ] <=32'h0 ; s tack [ 1 7 ] <=32'h0 ; s tack [ 1 8 ] <=32'h0 ; s tack
[ 1 9 ] <=32'h0 ;
292 stack [ 2 0 ] <=32'h0 ; s tack [ 2 1 ] <=32'h0 ; s tack [ 2 2 ] <=32'h0 ; s tack
[ 2 3 ] <=32'h0 ;
293 stack [ 2 4 ] <=32'h0 ; s tack [ 2 5 ] <=32'h0 ; s tack [ 2 6 ] <=32'h0 ; s tack
[ 2 7 ] <=32'h0 ;
I.1 RTL source code I-8
294 stack [ 2 8 ] <=32'h0 ; s tack [ 2 9 ] <=32'h0 ; s tack [ 3 0 ] <=32'h0 ; s tack
[ 3 1 ] <=32'h0 ;
295 //////
296 ca l l_s tack [ 0 ] <=16'h0 ; ca l l_s tack [ 1 ] <=16'h0 ; ca l l_s tack [ 2 ]
<=16'h0 ; ca l l_s tack [ 3 ] <=16'h0 ;
297 ca l l_s tack [ 4 ] <=16'h0 ; ca l l_s tack [ 5 ] <=16'h0 ; ca l l_s tack [ 6 ]
<=16'h0 ; ca l l_s tack [ 7 ] <=16'h0 ;
298 ca l l_s tack [ 8 ] <=16'h0 ; ca l l_s tack [ 9 ] <=16'h0 ; ca l l_s tack [ 1 0 ]
<=16'h0 ; ca l l_s tack [ 1 1 ] <=16'h0 ;
299 ca l l_s tack [ 1 2 ] <=16'h0 ; ca l l_s tack [ 1 3 ] <=16'h0 ; ca l l_s tack [ 1 4 ]
<=16'h0 ; ca l l_s tack [ 1 5 ] <=16'h0 ;
300 ca l l_s tack [ 1 6 ] <=16'h0 ; ca l l_s tack [ 1 7 ] <=16'h0 ; ca l l_s tack [ 1 8 ]
<=16'h0 ; ca l l_s tack [ 1 9 ] <=16'h0 ;
301 ca l l_s tack [ 2 0 ] <=16'h0 ; ca l l_s tack [ 2 1 ] <=16'h0 ; ca l l_s tack [ 2 2 ]
<=16'h0 ; ca l l_s tack [ 2 3 ] <=16'h0 ;
302 ca l l_s tack [ 2 4 ] <=16'h0 ; ca l l_s tack [ 2 5 ] <=16'h0 ; ca l l_s tack [ 2 6 ]
<=16'h0 ; ca l l_s tack [ 2 7 ] <=16'h0 ;
303 ca l l_s tack [ 2 8 ] <=16'h0 ; ca l l_s tack [ 2 9 ] <=16'h0 ; ca l l_s tack [ 3 0 ]
<=16'h0 ; ca l l_s tack [ 3 1 ] <=16'h0 ;
304 //////
305 call_SR [ 0 ] <=16'h0 ; call_SR [ 1 ] <=16'h0 ; call_SR [ 2 ] <=16'h0 ;
call_SR [ 3 ] <=16'h0 ;
306 call_SR [ 4 ] <=16'h0 ; call_SR [ 5 ] <=16'h0 ; call_SR [ 6 ] <=16'h0 ;
call_SR [ 7 ] <=16'h0 ;
307 call_SR [ 8 ] <=16'h0 ; call_SR [ 9 ] <=16'h0 ; call_SR [ 1 0 ] <=16'h0 ;
call_SR [ 1 1 ] <=16'h0 ;
308 call_SR [ 1 2 ] <=16'h0 ; call_SR [ 1 3 ] <=16'h0 ; call_SR [ 1 4 ] <=16'h0 ;
call_SR [ 1 5 ] <=16'h0 ;
309 call_SR [ 1 6 ] <=16'h0 ; call_SR [ 1 7 ] <=16'h0 ; call_SR [ 1 8 ] <=16'h0 ;
call_SR [ 1 9 ] <=16'h0 ;
310 call_SR [ 2 0 ] <=16'h0 ; call_SR [ 2 1 ] <=16'h0 ; call_SR [ 2 2 ] <=16'h0 ;
call_SR [ 2 3 ] <=16'h0 ;
311 call_SR [ 2 4 ] <=16'h0 ; call_SR [ 2 5 ] <=16'h0 ; call_SR [ 2 6 ] <=16'h0 ;
call_SR [ 2 7 ] <=16'h0 ;
312 call_SR [ 2 8 ] <=16'h0 ; call_SR [ 2 9 ] <=16'h0 ; call_SR [ 3 0 ] <=16'h0 ;
call_SR [ 3 1 ] <=16'h0 ;
313 //////
314 end
315 e l s e
316 begin
317 DM_Addr_reg <=DM_Addr ;
318 // Fetch Data memory Address in Fetch s tage or p i p e l i n e
s tage 2
319 i f ( (PM_out[15:8]==MPY) | | ( PM_out[15:8]==MAC) | | ( PM_out
[15:8]==OR) | | ( PM_out[15:8]==XOR) | | ( PM_out[15:8]==SUBS) | | (
PM_out[15:8]==ADDS) | | ( PM_out[15:8]==AND)
320 | | ( PM_out[15:8]==LDP) | | ( PM_out[15:8]==LT) | | ( PM_out[15:8]==
LTA) | | ( PM_out[15:8]==LTP) | | ( PM_out[15:8]==LTS) | | ( PM_out
[15:8]==MAR) | | ( PM_out[15:8]==ZALH)
I.1 RTL source code I-9
321 | | ( PM_out[15:8]==ZALS) | | ( PM_out[15:8]==SUBSIMD) | | ( PM_out
[15:8]==ADDSIMD) | | ( PM_out[15:9]==CMPSIMD) | | ( PM_out
[15:11]==LAR) | | ( PM_out[15:11]==SAR)
322 | | ( PM_out[15:13]==SUB) | | ( PM_out[15:13]== ADD) | | ( PM_out
[15:13]==LAC) | | ( PM_out[15:13]==SAC) ) get_DMAddr <=
1 ;
323 e l s e get_DMAddr <= 0 ;
324
325 i f (updated_AR == 1) updated_AR <= 0 ;
326




331 i f ( ( wr_data ==0)&&(IR2 [ 1 5 : 1 3 ] != SAC)&&(IR2 [ 1 5 : 1 3 ] != SAR) )
wr_data <=1'b1 ; //Read mode
332 //
//////////////////////////////////////////////////////////////////////////////
333 //−− 6 .1 P ip e l i n e s tage 4
334 i f ( stal l_mc4 == 0)
335 begin
336 case ( IR4 [ 1 5 : 1 3 ] )
337 ADD,SUB,LAC: begin
338 acc<= r e s u l t ;
339 SR <= SR_wire ;
340 end
341 endcase
342 case ( IR4 [ 1 5 : 1 1 ] )
343 LAR: AR[ IR4 [ 1 0 : 8 ] ] <=bu f f [ 1 5 : 0 ] ;
344 endcase
345 case ( IR4 [ 1 5 : 9 ] )
346 CMPSIMD: begin
347 SR<=SR_wire ;
348 acc<=r e s u l t ;
349 end
350 endcase




354 acc<=r e s u l t ;
355 end
356 MAC: begin
357 Preg <= Preg_wire ;
358 SR<=SR_wire ;
359 acc<=r e s u l t ;
360 end
361 MPY, MPYK: begin
I.1 RTL source code I-10
362 Preg <= Preg_wire ;
363 end
364 LACK: begin
365 acc <=IR4 [ 7 : 0 ] ;
366 end
367 LT: begin
368 i f (SR[13]==0) Treg [ 1 5 : 0 ] <= bu f f
[ 1 5 : 0 ] ; // i f over f l ow i s
r e s e t , then number i s cons ide r ed
p o s i t i v e
369 e l s e begin
// i f over f l ow i s
set , then number i s cons ide r ed
negat ive
370 Treg [ 1 5 ] <= bu f f
[ 3 1 ] ;
371 Treg [ 1 4 : 0 ] <= bu f f
[ 1 4 : 0 ] ;
372 end
373 end
374 LTA,LTS : begin
375 i f (SR[13]==0) Treg [ 1 5 : 0 ] <= bu f f
[ 1 5 : 0 ] ; // i f over f l ow i s
r e s e t , then number i s cons ide r ed
p o s i t i v e
376 e l s e begin
// i f over f l ow i s
set , then number i s cons ide r ed
negat ive
377 Treg [ 1 5 ] <= bu f f
[ 3 1 ] ;
378 Treg [ 1 4 : 0 ] <= bu f f




382 acc <= Preg ;
383 i f (SR[13]==0) Treg [ 1 5 : 0 ] <= bu f f
[ 1 5 : 0 ] ; // i f over f l ow i s
r e s e t , then number i s cons ide r ed
p o s i t i v e
384 e l s e begin
// i f over f l ow i s
set , then number i s cons ide r ed
negat ive
385 Treg [ 1 5 ] <= bu f f
[ 3 1 ] ;
386 Treg [ 1 4 : 0 ] <= bu f f
[ 1 4 : 0 ] ;




390 acc <={16'd0 , DM_out [ 1 5 : 0 ] } ;
391 end
392 ZALS : begin
393 acc <={DM_out [ 1 5 : 0 ] , 16 'd0 } ;
394 end
395 BANZ: begin
396 i f (AR[ARP] == 0) JFlag <=0;
397 e l s e begin
398 JFlag <=1'b1 ;
399 JFlag_c <= 1 'b1 ;
400 JAddr <=JAddr4 ;
401 end
402 end
403 LARP: ARP <=IR4 [ 2 : 0 ] ;
404 LDPK: DPPTR <=IR4 [ 7 : 0 ] ;
405 LDP: DPPTR <=DM_out [ 7 : 0 ] ;
406 endcase
407 case ( IR4 [ 1 5 : 0 ] )
408 APAC, SPAC: begin
409 acc<=r e s u l t ;
410 SR<= SR_wire ;
411 end
412 BGEZ: begin
413 i f ( acc >= 0) begin
414 JFlag_c <=1'b1 ;




419 i f ( acc > 0) begin
420 JFlag_c <=1'b1 ;




425 i f ( acc <= 0) begin
426 JAddr <=JAddr4 ;




431 i f ( acc < 0) begin
432 JAddr <=JAddr4 ;
433 JFlag_c <=1'b1 ;




437 i f ( acc != 0) begin
438 JAddr <=JAddr4 ;




443 i f (SR [ 1 3 ] == 1) begin
444 SR [ 1 3 ] <=0;
445 JFlag_c <=1'b1 ;




450 i f ( acc == 0) begin
451 JFlag_c <=1'b1 ;
452 JAddr <=JAddr4 ;
453 end
454 end
455 ROVM:SR[ 1 3 ] <= 0 ;
456 SOVM:SR [ 1 3 ] <= 1 ;
457 ZAC: acc <= 32 'h0 ;
458 PAC: acc <= Preg ;
459 PUSH: begin
460 stack [ SP ] <=acc ;
461 SP <=SP + 1 'b1 ;
462 end
463 POP: begin
464 acc <=stack [ SP−1 'b1 ] ;
465 SP <=SP − 1 'b1 ;
466 end
467 CALL: begin
468 call_SR [CSP] <=SR;
469 CSP <=CSP + 1 'b1 ;
470 end
471 RET: begin






477 //−− 6 .2 P ip e l i n e s tage 3
478 i f ( stal l_mc3 == 0)
479 begin
480 case ( IR3 [ 1 5 : 0 ] )
I.1 RTL source code I-13
481 APAC, SPAC: begin
482 alu_opcode <= IR3 ;
483 s1_in_reg <= Preg ;
484 end
485 endcase
486 case ( IR3 [ 1 5 : 9 ] )
487 CMPSIMD: begin
488 alu_opcode <= IR3 ;
489 s1_in_reg <= DM_out ;
490 i f ( IR3 [ 7 ] == 1) begin
491 PAR[ARP] <=AR[ARP] ;
492 PARP <=ARP;
493 updated_AR <= 1 ;
494 i f ( JFlag_c == 1)
begin
495 ARP <= PARP;
496 AR[PARP] <= PAR[PARP
] ;
497 end
498 e l s e
begin
499 case ({ IR3 [ 6 ] , IR3
[ 5 ] } )
500 2 'b00 : AR[
ARP] <=AR
[ARP] ;














509 case ( IR3 [ 1 5 : 8 ] )
510 OR,XOR,SUBS,ADDS,AND,SUBSIMD,ADDSIMD: begin
511 alu_opcode <= IR3 ;
I.1 RTL source code I-14
512 s1_in_reg <= DM_out ;
513 i f ( IR3 [ 7 ] == 1) begin
514 PAR[ARP] <=AR[ARP] ;
515 PARP <=ARP;
516 updated_AR <= 1 ;
517 i f ( JFlag_c == 1)
begin
518 ARP <= PARP;
519 AR[PARP] <= PAR[PARP
] ;
520 end
521 e l s e
begin
522 case ({ IR3 [ 6 ] , IR3
[ 5 ] } )
523 2 'b00 : AR[
ARP] <=AR
[ARP] ;













531 LTA,LTS : begin
532 bu f f <= DM_out ;
533 alu_opcode <= IR3 ;
534 s1_in_reg <= Preg ;
535 i f ( IR3 [ 7 ] == 1) begin
536 PAR[ARP] <=AR[ARP] ;
537 PARP <=ARP;
538 updated_AR <= 1 ;
539 i f ( JFlag_c == 1)
begin
540 ARP <= PARP;
541 AR[PARP] <= PAR[PARP
] ;
542 end
I.1 RTL source code I-15
543 e l s e
begin
544 case ({ IR3 [ 6 ] , IR3
[ 5 ] } )
545 2 'b00 : AR[
ARP] <=AR
[ARP] ;














554 bu f f <= DM_out ;
555 i f ( IR3 [ 7 ] == 1) begin
556 PAR[ARP] <=AR[ARP] ;
557 PARP <=ARP;
558 updated_AR <= 1 ;
559 i f ( JFlag_c == 1)
begin
560 ARP <= PARP;
561 AR[PARP] <= PAR[PARP
] ;
562 end
563 e l s e
begin
564 case ({ IR3 [ 6 ] , IR3
[ 5 ] } )
565 2 'b00 : AR[
ARP] <=AR
[ARP] ;









569 ARP <=IR3 [ 2 : 0 ] ;





574 mpy_opcode <= IR3 ;
575 buff_mult_2 <= DM_out ;
576 i f ( IR3 [ 7 ] == 1) begin
577 PAR[ARP] <=AR[ARP] ;
578 PARP <=ARP;
579 updated_AR <= 1 ;
580 i f ( JFlag_c == 1)
begin
581 ARP <= PARP;
582 AR[PARP] <= PAR[PARP
] ;
583 end
584 e l s e
begin
585 case ({ IR3 [ 6 ] , IR3
[ 5 ] } )
586 2 'b00 : AR[
ARP] <=AR
[ARP] ;













594 MPYK: mpy_opcode <= IR3 ;
595 MAC: begin
596 mpy_opcode <= IR3 ;
597 alu_opcode <= IR3 ;
598 s1_in_reg <= Preg ;
599 buff_mult_2 <= DM_out ;
600 updated_AR <= 1 ;
601 i f ( IR3 [ 7 ] == 1) begin
602 PAR[ARP] <=AR[ARP] ;
603 PARP <=ARP;
I.1 RTL source code I-17
604 i f ( JFlag_c == 1)
begin
605 ARP <= PARP;
606 AR[PARP] <= PAR[PARP
] ;
607 end
608 e l s e
begin
609 case ({ IR3 [ 6 ] , IR3
[ 5 ] } )
610 2 'b00 : AR[
ARP] <=AR
[ARP] ;














619 i f ( IR3 [ 7 ] == 1) begin
620 PAR[ARP] <=AR[ARP] ;
621 PARP <=ARP;
622 updated_AR <= 1 ;
623 i f ( JFlag_c == 1)
begin
624 ARP <= PARP;
625 AR[PARP] <= PAR[PARP
] ;
626 end
627 e l s e
begin
628 case ({ IR3 [ 6 ] , IR3
[ 5 ] } )
629 2 'b00 : AR[
ARP] <=AR
[ARP] ;




I.1 RTL source code I-18










638 case ( IR3 [ 1 5 : 1 3 ] )
639 SUB, ADD,LAC: begin
640 alu_opcode <= IR3 ;
641 updated_AR <= 1 ;
642 s1_in_reg <= bu f f ;
643 i f ( IR3 [ 7 ] == 1) begin
644 PAR[ARP] <=AR[ARP] ;
645 bu f f <= DM_out ;
646 s1_in_reg <= DM_out ;
647 PARP <=ARP;
648 i f ( JFlag_c == 1)
begin
649 ARP <= PARP;
650 AR[PARP] <= PAR[PARP
] ;
651 end
652 e l s e
begin
653 case ({ IR3 [ 6 ] , IR3
[ 5 ] } )
654 2 'b00 : AR[
ARP] <=AR
[ARP] ;














663 alu_opcode <= IR3 ;
I.1 RTL source code I-19
664 s1_in_reg <= DM_out ;
665 updated_AR <= 1 ;
666 i f ( IR3 [ 7 ] == 1) begin
667 PAR[ARP] <=AR[ARP] ;
668 PARP <=ARP;
669 i f ( JFlag_c == 1)
begin
670 ARP <= PARP;
671 AR[PARP] <= PAR[PARP
] ;
672 end
673 e l s e
begin
674 case ({ IR3 [ 6 ] , IR3
[ 5 ] } )
675 2 'b00 : AR[
ARP] <=AR
[ARP] ;














684 case ( IR3 [ 1 5 : 1 1 ] )
685 LAR: begin
686 bu f f <= DM_out ;
687 i f ( IR3 [ 7 ] == 1) begin
688 PAR[ARP] <=AR[ARP] ;
689 PARP <=ARP;
690 updated_AR <= 1 ;
691 i f ( JFlag_c == 1)
begin
692 ARP <= PARP;
693 AR[PARP] <= PAR[PARP
] ;
694 end
695 e l s e
begin
I.1 RTL source code I-20
696 case ({ IR3 [ 6 ] , IR3
[ 5 ] } )
697 2 'b00 : AR[
ARP] <=AR
[ARP] ;














706 i f ( IR3 [ 7 ] == 1) begin
707 PAR[ARP] <=AR[ARP] ;
708 PARP <=ARP;
709 updated_AR <= 1 ;
710 i f ( JFlag_c == 1)
begin
711 ARP <= PARP;
712 AR[PARP] <= PAR[PARP
] ;
713 end
714 e l s e
begin
715 case ({ IR3 [ 6 ] , IR3
[ 5 ] } )
716 2 'b00 : AR[
ARP] <=AR
[ARP] ;













I.1 RTL source code I-21
724 LARK: AR[ IR3 [ 1 0 : 8 ] ] [ 7 : 0 ] <=IR3 [ 7 : 0 ] ;
725 LARKH:AR[ IR3 [ 1 0 : 8 ] ] [ 1 5 : 8 ] <=IR3 [ 7 : 0 ] ;
726 endcase
727 end
728 //−− 6 .3 P ip e l i n e s tage 2
729 i f ( stal l_mc2 == 0)
730 begin
731 case ( IR2 [ 1 5 : 9 ] )
732 CMPSIMD: begin
733 wr_data <=1'b1 ; // For read
734 end
735 endcase
736 case ( IR2 [ 1 5 : 8 ] )
737 BANZ: begin




741 wr_data <=1'b1 ; // For read
742 end
743 endcase
744 case ( IR2 [ 1 5 : 1 1 ] )
745 LAR: begin
746 wr_data <=1'b1 ; // For read
747 end
748 SAR: begin
749 next_opcode<= IR2 ;
750 wr_data <=1'b0 ; // For wr i t e
751 end
752 endcase
753 case ( IR2 [ 1 5 : 1 3 ] )
754 SUB, ADD, LAC: begin
755 wr_data <=1'b1 ; // For read
756 i f ( IR2 [ 7 ] ==0) bu f f = DM_out ;
757 end
758 SAC: begin
759 next_opcode<= IR2 ;
760 wr_data <=1'b0 ; // For wr i t e
761 end
762 endcase
763 case ( IR2 [ 1 5 : 0 ] )
764 BGEZ,BGZ,BLEZ,BLZ,BNZ,BV,BZ: begin
765 i f ( branch_predict == 0)
JAddr <=PM_out ;
766 e l s e begin
JAddr2<=PM_out ; end
I.1 RTL source code I-22
767 end
768 BU: begin
769 i f ( branch_predict == 0)
JAddr <=PM_out ;




773 ca l l_s tack [CSP] <=PC;
774 i f ( branch_predict == 0)
JAddr <=PM_out ;




778 CSP <=CSP − 1 ;
779 i f ( branch_predict == 0)
JAddr <=ca l l_s tack [CSP− 1 ] [ 1 4 : 0 ] ;
780 e l s e begin







785 i f ( stal l_mc3 == 0)
786 begin
787 i f ( IR4 [ 1 5 : 0 ] ==BU | |
788 IR4 [ 1 5 : 0 ] ==BGEZ | |
789 IR4 [ 1 5 : 0 ] ==BGZ | |
790 IR4 [ 1 5 : 0 ] ==BLEZ | |
791 IR4 [ 1 5 : 0 ] ==BLZ | |
792 IR4 [ 1 5 : 0 ] ==BNZ | |
793 IR4 [ 1 5 : 0 ] ==BV | |
794 IR4 [ 1 5 : 0 ] ==BZ | |
795 IR4 [ 1 5 : 0 ] ==CALL | |
796 IR4 [ 1 5 : 0 ] ==RET | |
797 IR4 [ 1 5 : 8 ] ==BANZ ) begin
798 stal l_mc3 <=1'b1 ; IR4 <=16' h f f f f ;
799 end
800 e l s e i f ( JFlag_c == 1) begin
801 stal l_mc3 <=1;IR4 <=16' h f f f f ;
802 end
803 e l s e begin
804 stal l_mc4 <=0;
805 IR4<= IR3 ;
I.1 RTL source code I-23
806 end




810 i f ( stal l_mc2 == 0)
811 begin
812 i f ( IR3 [ 1 5 : 0 ] ==BU | |
813 IR3 [ 1 5 : 0 ] ==BGEZ | |
814 IR3 [ 1 5 : 0 ] ==BGZ | |
815 IR3 [ 1 5 : 0 ] ==BLEZ | |
816 IR3 [ 1 5 : 0 ] ==BLZ | |
817 IR3 [ 1 5 : 0 ] ==BNZ | |
818 IR3 [ 1 5 : 0 ] ==BV | |
819 IR3 [ 1 5 : 0 ] ==BZ | |
820 IR3 [ 1 5 : 0 ] ==CALL | |
821 IR3 [ 1 5 : 0 ] ==RET | |
822 IR3 [ 1 5 : 8 ] ==BANZ) begin stal l_mc2 <=1; IR3 <=16'
h f f f f ; end
823 e l s e i f ( JFlag_c == 1) begin
824 stal l_mc2 <=1;IR3 <=16' h f f f f ;
825 end
826 e l s e begin
stal l_mc3 <=0; IR3 <=IR2 ; end




830 i f ( stal l_mc1 == 0)
831 begin
832 i f ( check_condit ion ) begin
833 JAddr <=JAddr2 ;
834 check_condit ion <= 0 ;
835 end
836
837 i f ( branch_predict ) PC
<=JAddr ;
838 e l s e i f ( ( IR2 [13:0]==BU) | | ( IR2 [15:0]==CALL) ) PC<=
PM_out ;
839 e l s e i f ( IR2 [15:0]==RET) PC<=
ca l l_s tack [CSP− 1 ] [ 1 4 : 0 ] + 2 ;
840 e l s e PC
<=PC + 1 'b1 ;
841
842 i f ( IR2 [ 1 5 : 0 ] ==BU | |
843 IR2 [ 1 5 : 0 ] ==BGEZ | |
844 IR2 [ 1 5 : 0 ] ==BGZ | |
I.1 RTL source code I-24
845 IR2 [ 1 5 : 0 ] ==BLEZ | |
846 IR2 [ 1 5 : 0 ] ==BLZ | |
847 IR2 [ 1 5 : 0 ] ==BNZ | |
848 IR2 [ 1 5 : 0 ] ==BZ | |
849 IR2 [ 1 5 : 0 ] ==BV | |
850 IR2 [ 1 5 : 0 ] ==CALL | |
851 IR2 [ 1 5 : 0 ] ==RET | |
852 IR2 [ 1 5 : 8 ] ==BANZ ) IR2
<=16' h f f f f ;
853 e l s e begin

















2 // Author : Shashank Simha
3 // Date : 12/12/2017
4 // Un ive r s i ty : Rochester I n s t i t u t e o f Technology
5 // Desc r ip t i on : This i s a part o f the DSP implemented f o r grad p r o j e c t
6 //
////////////////////////////////////////////////////////////////////////////
7 module ALU (
8 re s e t ,








17 r e su l t ,
18 ov ,
19 carry ,













33 zero_4 ) ;
34
35 input
36 r e s e t , // system r e s e t
37 c l k ; // system c lock
38
39 input
40 scan_in0 , // t e s t scan mode data input
41 scan_en , // t e s t scan mode enable
42 test_mode ; // t e s t mode s e l e c t
I.1 RTL source code I-26
43
44 output
45 scan_out0 ; // t e s t scan mode data output
46
47 input [ 3 1 : 0 ] a , b ;
48 input [ 1 5 : 0 ] opcode ;
49 output [ 3 1 : 0 ] r e s u l t ;
50 output ov , carry , negat ive , zero , ov_2 , carry_2 , negative_2 , zero_2 , ov_3 , carry_3 ,
negative_3 , zero_3 , ov_4 , carry_4 , negative_4 , zero_4 ;
51 //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
52 //−− 1 ISA Parameters
53 //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
54 parameter [ 7 : 0 ] MPY = 8 ' b00000000 ; // 1 .
55 parameter [ 7 : 0 ] MPYK = 8 ' b10100000 ; // 2 .
56 parameter [ 7 : 0 ] MAC = 8 ' h02 ; // 3 .
57 parameter [ 7 : 0 ] OR = 8 ' h03 ; // 4 .
58 parameter [ 7 : 0 ] XOR = 8 ' h04 ; // 5 .
59 parameter [ 1 5 : 0 ] SPAC = 16 ' h0100 ; // 6 .
60 parameter [ 2 : 0 ] SUB = 3 'h1 ; // 7 .
61 parameter [ 7 : 0 ] SUBS = 8 ' h05 ; // 8 .
62 parameter [ 2 : 0 ] ADD = 3 'h2 ; // 9 .
63 parameter [ 7 : 0 ] ADDS = 8 ' h06 ; // 10 .
64 parameter [ 7 : 0 ] AND = 8 ' h07 ; // 11 .
65 parameter [ 1 5 : 0 ] BU = 16 'h010F ; // 12 .
66 parameter [ 1 5 : 0 ] BGEZ = 16 ' h0101 ; // 14 .
67 parameter [ 1 5 : 0 ] BGZ = 16 ' h0102 ; // 15 .
68 parameter [ 1 5 : 0 ] BLEZ = 16 ' h0103 ; // 16 .
69 parameter [ 1 5 : 0 ] BLZ = 16 ' h0104 ; // 17 .
70 parameter [ 1 5 : 0 ] BNZ = 16 ' h0105 ; // 18 .
71 parameter [ 1 5 : 0 ] BV = 16 ' h0106 ; // 19 .
72 parameter [ 1 5 : 0 ] BZ = 16 ' h0107 ; // 20 .
73 parameter [ 2 : 0 ] LAC = 3 'h3 ; // 21 .
74 parameter [ 7 : 0 ] LACK = 8 ' b10100001 ; // 22 .
75 parameter [ 7 : 0 ] LAR = 5 ' b11000 ; // 23 .
76 parameter [ 7 : 0 ] LARK = 5 ' b11001 ; // 24 .
77 parameter [ 7 : 0 ] LARKH = 5 ' b11011 ; // 25 .
78 parameter [ 7 : 0 ] LARP = 8 ' b10100011 ; // 26 .
79 parameter [ 7 : 0 ] LDP = 8 ' h09 ; // 27 .
80 parameter [ 7 : 0 ] LDPK = 8 ' b10100100 ; // 28 .
81 parameter [ 7 : 0 ] LT = 8 ' h0a ; // 29 .
82 parameter [ 7 : 0 ] LTA = 8 'h0b ; // 30 .
83 parameter [ 7 : 0 ] LTD = 8 ' h0c ; // 31 .
84 parameter [ 7 : 0 ] LTP = 8 'h0d ; // 32 .
85 parameter [ 7 : 0 ] LTS = 8 ' h0e ; // 33 .
86 parameter [ 7 : 0 ] MAR = 8 ' h0f ; // 34 .
87 parameter [ 1 5 : 0 ] PAC = 16 'h011F ; // 35 .
88 parameter [ 1 5 : 0 ] ROVM = 16 'h012F ; // 36 .
89 parameter [ 2 : 0 ] SAC = 3 'h4 ; // 37 .
90 parameter [ 7 : 0 ] SAR = 5 ' b11010 ; // 38 .
I.1 RTL source code I-27
91 parameter [ 1 5 : 0 ] NOP = 16 'h014F ; // 42 .
92 parameter [ 1 5 : 0 ] ZAC = 16 'h015F ; // 43 .
93 parameter [ 7 : 0 ] ZALH = 8 ' h13 ; // 44 .
94 parameter [ 7 : 0 ] ZALS = 8 ' h14 ; // 45 .
95 parameter [ 1 5 : 0 ] APAC = 16 'h016F ; // 46 .
96 parameter [ 6 : 0 ] CMPSIMD = 7 ' b0001101 ; // 47 .
97 parameter [ 7 : 0 ] SUBSIMD = 8 ' h16 ; // 48 .
98 parameter [ 7 : 0 ] ADDSIMD = 8 ' h17 ; // 49 .
99 parameter [ 8 : 0 ] BANZ = 8 ' h18 ; // 13 .
100 parameter [ 1 5 : 0 ] PUSH = 16 'h017F ; // 50 .
101 parameter [ 1 5 : 0 ] POP = 16 'h018F ; // 51 .
102 parameter [ 1 5 : 0 ] CALL = 16 'h01AF ; // 52 .
103 parameter [ 1 5 : 0 ] RET = 16 'h019F ; // 53 .
104
105 wire [ 3 1 : 0 ] out_comp ;
106 wire comp_en ;
107 wire [ 3 1 : 0 ] b_sub ;
108 wire [ 7 : 0 ] a1 , a2 , a3 , a4 ;
109 wire [ 7 : 0 ] b1 , b2 , b3 , b4 ;
110 wire [ 3 : 0 ] Cin ;
111 wire [ 3 : 0 ] Cout ;
112 wire [ 7 : 0 ] S1 , S2 , S3 , S4 ;
113
114 a s s i gn ov =(( opcode [15:13]==ADD) | | ( opcode [15:8]==ADDS) | | ( opcode [15:0]==
APAC) | | ( opcode [15:8]==LTA) | | ( opcode [15:8]==MAC) | | ( opcode [15:8]==ADDSIMD) )
? ( ( a [ 3 1 ] ~^ b [ 3 1 ] )&& r e s u l t [ 3 1 ] ) :
115 ( ( opcode [15:0]==SPAC) | | ( opcode [15:8]==LTS) | | ( opcode [15:8]==
SUBSIMD) )
? ( ( a [ 3 1 ] ^ b [ 3 1 ] )&& r e s u l t [ 3 1 ] )
116 : 0 ;
117 a s s i gn ov_2 =(opcode [15:8]==ADDSIMD) ? ( ( a [ 2 3 ] ~^ b [ 2 3 ] )&& r e s u l t [ 2 3 ] ) : (
opcode [15:8]==SUBSIMD) ? ( ( a [ 2 3 ] ^ b [ 2 3 ] )&& r e s u l t [ 2 3 ] ) : 0 ;
118 a s s i gn ov_3 =(opcode [15:8]==ADDSIMD) ? ( ( a [ 1 5 ] ~^ b [ 1 5 ] )&& r e s u l t [ 1 5 ] ) : (
opcode [15:8]==SUBSIMD) ? ( ( a [ 1 5 ] ^ b [ 1 5 ] )&& r e s u l t [ 1 5 ] ) : 0 ;
119 a s s i gn ov_4 =(opcode [15:8]==ADDSIMD) ? ( ( a [ 7 ] ~^ b [ 7 ] )&& r e s u l t [ 7 ] ) : (
opcode [15:8]==SUBSIMD) ? ( ( a [ 7 ] ^ b [ 7 ] )&& r e s u l t [ 7 ] ) : 0 ;
120
121 a s s i gn carry = Cout [ 0 ] ;
122 a s s i gn carry_2= Cout [ 1 ] ;
123 a s s i gn carry_3 = Cout [ 2 ] ;
124 a s s i gn carry_4 = Cout [ 3 ] ;
125
126 a s s i gn negat ive = r e s u l t [ 3 1 ] ;
127 a s s i gn negative_2 = r e s u l t [ 2 3 ] ;
128 a s s i gn negative_3 = r e s u l t [ 1 5 ] ;
129 a s s i gn negative_4 = r e s u l t [ 7 ] ;
130
131 a s s i gn zero =(( opcode [15:8]==ADDSIMD) | | ( opcode [15:8]==SUBSIMD) )
? ( ( r e s u l t [ 3 1 : 2 4 ] == 0) ? 1 : 0 ) : ( r e s u l t==0)? 1 : 0 ;
I.1 RTL source code I-28
132 a s s i gn zero_2 = ( r e s u l t [23:16]==0) ? 1 : 0 ;
133 a s s i gn zero_3 = ( r e s u l t [15 :8 ]==0) ? 1 : 0 ;
134 a s s i gn zero_4 = ( r e s u l t [7 :0 ]==0) ? 1 : 0 ;
135
136 a s s i gn r e s u l t = ( ( opcode [ 1 5 : 1 3 ] == ADD) | | ( opcode [ 1 5 : 8 ] == ADDS) | | ( opcode
[ 1 5 : 0 ] == APAC) | | ( opcode [ 1 5 : 8 ] == MAC) | | ( opcode [ 1 5 : 8 ] == LTA) | | ( opcode
[ 1 5 : 1 3 ] == SUB) | |
137 ( opcode [ 1 5 : 0 ] == SPAC) | | ( opcode [ 1 5 : 8 ] == SUBS) | | ( opcode [ 1 5 : 8 ] == LTS
) | | ( opcode [ 1 5 : 8 ] == SUBSIMD) | | ( opcode [ 1 5 : 8 ] == ADDSIMD) ) ? {S4
, S3 , S2 , S1 } :
138 ( opcode [ 1 5 : 9 ] == CMPSIMD)
? out_comp :
139 ( opcode [ 1 5 : 8 ] == OR)
? ( a | b) :
140 ( opcode [ 1 5 : 8 ] == AND)
? ( a & b) :
141 ( opcode [ 1 5 : 8 ] == XOR)
? ( a ^ b) :
142 ( ( opcode [ 1 5 : 1 3 ] == SAC) | | ( opcode [ 1 5 : 1 3 ] == LAC) )




145 a s s i gn comp_en = ( opcode [ 1 5 : 9 ] == CMPSIMD) ? 1 : 0 ;
146
147 a s s i gn a4 = a [ 3 1 : 2 4 ] ;
148 a s s i gn a3 = a [ 2 3 : 1 6 ] ;
149 a s s i gn a2 = a [ 1 5 : 8 ] ;
150 a s s i gn a1 = a [ 7 : 0 ] ;
151
152 a s s i gn b_sub = ( opcode [ 1 5 : 8 ] == SUBSIMD) ? {(~b [ 3 1 : 2 4 ] ) +1 ,(~b [ 2 3 : 1 6 ] ) +1 ,(~b
[ 1 5 : 8 ] ) +1 ,(~b [ 7 : 0 ] ) +1} : ( ~ b) + 1 ; // 2 ' s complement f o r
sub t ra c t i on
153
154 a s s i gn b4 = ( ( opcode [ 1 5 : 1 3 ] == SUB) | | ( opcode [ 1 5 : 8 ] == SUBS) | | ( opcode [ 1 5 : 8 ]
== SUBSIMD) | | ( opcode [ 1 5 : 0 ] == SPAC) | | ( opcode [ 1 5 : 8 ] == LTS) )
? b_sub [ 3 1 : 2 4 ] : b [ 3 1 : 2 4 ] ;
155 a s s i gn b3 = ( ( opcode [ 1 5 : 1 3 ] == SUB) | | ( opcode [ 1 5 : 8 ] == SUBS) | | ( opcode [ 1 5 : 8 ]
== SUBSIMD) | | ( opcode [ 1 5 : 0 ] == SPAC) | | ( opcode [ 1 5 : 8 ] == LTS) )
? b_sub [ 2 3 : 1 6 ] : b [ 2 3 : 1 6 ] ;
156 a s s i gn b2 = ( ( opcode [ 1 5 : 1 3 ] == SUB) | | ( opcode [ 1 5 : 8 ] == SUBS) | | ( opcode [ 1 5 : 8 ]
== SUBSIMD) | | ( opcode [ 1 5 : 0 ] == SPAC) | | ( opcode [ 1 5 : 8 ] == LTS) )
? b_sub [ 1 5 : 8 ] : b [ 1 5 : 8 ] ;
I.1 RTL source code I-29
157 a s s i gn b1 = ( ( opcode [ 1 5 : 1 3 ] == SUB) | | ( opcode [ 1 5 : 8 ] == SUBS) | | ( opcode [ 1 5 : 8 ]
== SUBSIMD) | | ( opcode [ 1 5 : 0 ] == SPAC) | | ( opcode [ 1 5 : 8 ] == LTS) )
? b_sub [ 7 : 0 ] : b [ 7 : 0 ] ;
158
159
160 a s s i gn Cin [ 0 ] = 0 ;
161 a s s i gn Cin [1 ]= ( ( opcode [ 1 5 : 8 ] == ADDSIMD) | | ( opcode [ 1 5 : 8 ] == SUBSIMD) ) ? 0 :
Cout [ 0 ] ;
162 a s s i gn Cin [2 ]= ( ( opcode [ 1 5 : 8 ] == ADDSIMD) | | ( opcode [ 1 5 : 8 ] == SUBSIMD) ) ? 0 :
Cout [ 1 ] ;
163 a s s i gn Cin [3 ]= ( ( opcode [ 1 5 : 8 ] == ADDSIMD) | | ( opcode [ 1 5 : 8 ] == SUBSIMD) ) ? 0 :
Cout [ 2 ] ;
164
165
166 adder A1 ( .A ( a1 ) ,
167 .B (b1 ) ,
168 . Cin (Cin [ 0 ] ) ,
169 . Cout (Cout [ 0 ] ) ,
170 .Sum (S1 ) ,
171 . scan_en ( scan_en ) ,
172 . scan_in0 ( scan_in0 ) ,
173 . test_mode ( test_mode ) ,
174 . scan_out0 ( scan_out0 )
175 ) ;
176
177 adder A2 ( .A ( a2 ) ,
178 .B (b2 ) ,
179 . Cin (Cin [ 1 ] ) ,
180 . Cout (Cout [ 1 ] ) ,
181 .Sum (S2 ) ,
182 . scan_en ( scan_en ) ,
183 . scan_in0 ( scan_in0 ) ,
184 . test_mode ( test_mode ) ,
185 . scan_out0 ( scan_out0 )
186 ) ;
187
188 adder A3 ( .A ( a3 ) ,
189 .B (b3 ) ,
190 . Cin (Cin [ 2 ] ) ,
191 . Cout (Cout [ 2 ] ) ,
192 .Sum (S3 ) ,
193 . scan_en ( scan_en ) ,
194 . scan_in0 ( scan_in0 ) ,
195 . test_mode ( test_mode ) ,
196 . scan_out0 ( scan_out0 )
197 ) ;
198
199 adder A4 ( .A ( a4 ) ,
200 .B (b4 ) ,
I.1 RTL source code I-30
201 . Cin (Cin [ 3 ] ) ,
202 . Cout (Cout [ 3 ] ) ,
203 .Sum (S4 ) ,
204 . scan_en ( scan_en ) ,
205 . scan_in0 ( scan_in0 ) ,
206 . test_mode ( test_mode ) ,
207 . scan_out0 ( scan_out0 )
208 ) ;
209
210 compare_select comp_4 ( . scan_in0 ( scan_in0 ) ,
211 . scan_en ( scan_en ) ,
212 . test_mode ( test_mode ) ,
213 . scan_out0 ( scan_out0 ) ,
214 .A( a [ 3 1 : 2 4 ] ) ,
215 .B(b [ 3 1 : 2 4 ] ) ,
216 . f l a g ( opcode [ 8 ] ) ,
217 .C(out_comp [ 3 1 : 2 4 ] ) ,
218 . en (comp_en)
219 ) ;
220 compare_select comp_3 ( . scan_in0 ( scan_in0 ) ,
221 . scan_en ( scan_en ) ,
222 . test_mode ( test_mode ) ,
223 . scan_out0 ( scan_out0 ) ,
224 .A( a [ 2 3 : 1 6 ] ) ,
225 .B(b [ 2 3 : 1 6 ] ) ,
226 . f l a g ( opcode [ 8 ] ) ,
227 .C(out_comp [ 2 3 : 1 6 ] ) ,
228 . en (comp_en)
229 ) ;
230 compare_select comp_2 ( . scan_in0 ( scan_in0 ) ,
231 . scan_en ( scan_en ) ,
232 . test_mode ( test_mode ) ,
233 . scan_out0 ( scan_out0 ) ,
234 .A( a [ 1 5 : 8 ] ) ,
235 .B(b [ 1 5 : 8 ] ) ,
236 . f l a g ( opcode [ 8 ] ) ,
237 .C(out_comp [ 1 5 : 8 ] ) ,
238 . en (comp_en)
239 ) ;
240 compare_select comp_1 ( . scan_in0 ( scan_in0 ) ,
241 . scan_en ( scan_en ) ,
242 . test_mode ( test_mode ) ,
243 . scan_out0 ( scan_out0 ) ,
244 .A( a [ 7 : 0 ] ) ,
245 .B(b [ 7 : 0 ] ) ,
246 . f l a g ( opcode [ 8 ] ) ,
247 .C(out_comp [ 7 : 0 ] ) ,
248 . en (comp_en)
249 ) ;
I.1 RTL source code I-31
250
251 endmodule




2 // Author : Shashank Simha
3 // Date : 12/12/2017
4 // Un ive r s i ty : Rochester I n s t i t u t e o f Technology
5 // Desc r ip t i on : This i s a part o f the DSP implemented f o r grad p r o j e c t
6 //
////////////////////////////////////////////////////////////////////////////




11 sh i f t_ in ,
12 opcode ,
13 sh i f t_out ) ;
14 //−− 1 ISA Parameters
15 //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
16 parameter [ 7 : 0 ] MPY = 8 ' b00000000 ; // 1 .
17 parameter [ 7 : 0 ] MPYK = 8 ' b10100000 ; // 2 .
18 parameter [ 7 : 0 ] MAC = 8 ' h02 ; // 3 .
19 parameter [ 7 : 0 ] OR = 8 ' h03 ; // 4 .
20 parameter [ 7 : 0 ] XOR = 8 ' h04 ; // 5 .
21 parameter [ 1 5 : 0 ] SPAC = 16 ' h0100 ; // 6 .
22 parameter [ 2 : 0 ] SUB = 3 'h1 ; // 7 .
23 parameter [ 7 : 0 ] SUBS = 8 ' h05 ; // 8 .
24 parameter [ 2 : 0 ] ADD = 3 'h2 ; // 9 .
25 parameter [ 7 : 0 ] ADDS = 8 ' h06 ; // 10 .
26 parameter [ 7 : 0 ] AND = 8 ' h07 ; // 11 .
27 parameter [ 1 5 : 0 ] BU = 16 'h010F ; // 12 .
28 parameter [ 8 : 0 ] BANZ = 8 ' h01 ; // 13 .
29 parameter [ 1 5 : 0 ] BGEZ = 16 ' h0101 ; // 14 .
30 parameter [ 1 5 : 0 ] BGZ = 16 ' h0102 ; // 15 .
31 parameter [ 1 5 : 0 ] BLEZ = 16 ' h0103 ; // 16 .
32 parameter [ 1 5 : 0 ] BLZ = 16 ' h0104 ; // 17 .
33 parameter [ 1 5 : 0 ] BNZ = 16 ' h0105 ; // 18 .
34 parameter [ 1 5 : 0 ] BV = 16 ' h0106 ; // 19 .
35 parameter [ 1 5 : 0 ] BZ = 16 ' h0107 ; // 20 .
36 parameter [ 2 : 0 ] LAC = 3 'h3 ; // 21 .
37 parameter [ 7 : 0 ] LACK = 8 ' b10100001 ; // 22 .
38 parameter [ 7 : 0 ] LAR = 5 ' b11000 ; // 23 .
39 parameter [ 7 : 0 ] LARK = 5 ' b11001 ; // 24 .
40 parameter [ 7 : 0 ] LARKH = 5 ' b11011 ; // 25 .
41 parameter [ 7 : 0 ] LARP = 8 ' b10100011 ; // 26 .
42 parameter [ 7 : 0 ] LDP = 8 ' h09 ; // 27 .
I.1 RTL source code I-33
43 parameter [ 7 : 0 ] LDPK = 8 ' b10100100 ; // 28 .
44 parameter [ 7 : 0 ] LT = 8 ' h0a ; // 29 .
45 parameter [ 7 : 0 ] LTA = 8 'h0b ; // 30 .
46 parameter [ 7 : 0 ] LTD = 8 ' h0c ; // 31 .
47 parameter [ 7 : 0 ] LTP = 8 'h0d ; // 32 .
48 parameter [ 7 : 0 ] LTS = 8 ' h0e ; // 33 .
49 parameter [ 7 : 0 ] MAR = 8 ' h0f ; // 34 .
50 parameter [ 1 5 : 0 ] PAC = 16 'h011F ; // 35 .
51 parameter [ 1 5 : 0 ] ROVM = 16 'h012F ; // 36 .
52 parameter [ 2 : 0 ] SAC = 3 'h4 ; // 37 .
53 parameter [ 7 : 0 ] SAR = 5 ' b11010 ; // 38 .
54 parameter [ 1 5 : 0 ] SOVM = 16 'h013F ; // 39 .
55 parameter [ 7 : 0 ] TBLR = 8 ' h11 ; // 40 .
56 parameter [ 7 : 0 ] TBLW = 8 ' h12 ; // 41 .
57 parameter [ 1 5 : 0 ] NOP = 16 'h014F ; // 42 .
58 parameter [ 1 5 : 0 ] ZAC = 16 'h015F ; // 43 .
59 parameter [ 7 : 0 ] ZALH = 8 ' h13 ; // 44 .
60 parameter [ 7 : 0 ] ZALS = 8 ' h14 ; // 45 .
61 parameter [ 1 5 : 0 ] APAC = 16 'h016F ; // 46 .
62 parameter [ 7 : 0 ] CMPSIMD = 8 ' h15 ; // 47 .
63 parameter [ 7 : 0 ] SUBSIMD = 8 ' h16 ; // 48 .
64 parameter [ 7 : 0 ] ADDSIMD = 8 ' h17 ; // 49 .
65 parameter [ 1 5 : 0 ] PUSH = 16 'h017F ; // 50 .
66 parameter [ 1 5 : 0 ] POP = 16 'h018F ; // 51 .
67 parameter [ 1 5 : 0 ] CALL = 16 'h01AF ; // 52 .
68 parameter [ 1 5 : 0 ] RET = 16 'h019F ; // 53 .
69




74 output scan_out0 ;
75
76 input [ 3 1 : 0 ] s h i f t_ in ;
77 input [ 1 5 : 0 ] opcode ;
78
79 output [ 3 1 : 0 ] sh i f t_out ;
80
81 // Sh i f t s operand f o r ADD and SUB
82 a s s i gn sh i f t_out = ( ( opcode [15:13]==ADD) | | ( opcode [15:13]==SUB) ) ?
83 ( ( opcode [12:8 ]==5 ' b00000 ) ? sh i f t_ in :
84 ( opcode [12:8 ]==5 ' b00001 ) ? {1 'h0 , s h i f t_ in [ 3 1 : 1 ] } :
85 ( opcode [12:8 ]==5 ' b00010 ) ? {2 'h0 , s h i f t_ in [ 3 1 : 2 ] } :
86 ( opcode [12:8 ]==5 ' b00011 ) ? {3 'h0 , s h i f t_ in [ 3 1 : 3 ] } :
87 ( opcode [12:8 ]==5 ' b00100 ) ? {4 'h0 , s h i f t_ in [ 3 1 : 4 ] } :
88 ( opcode [12:8 ]==5 ' b00101 ) ? {5 'h0 , s h i f t_ in [ 3 1 : 5 ] } :
89 ( opcode [12:8 ]==5 ' b00110 ) ? {6 'h0 , s h i f t_ in [ 3 1 : 6 ] } :
90 ( opcode [12:8 ]==5 ' b00111 ) ? {7 'h0 , s h i f t_ in [ 3 1 : 7 ] } :
91 ( opcode [12:8 ]==5 ' b01000 ) ? {8 'h0 , s h i f t_ in [ 3 1 : 8 ] } :
I.1 RTL source code I-34
92 ( opcode [12:8 ]==5 ' b01001 ) ? {9 'h0 , s h i f t_ in [ 3 1 : 9 ] } :
93 ( opcode [12:8 ]==5 ' b01010 ) ? {10 'h0 , s h i f t_ in [ 3 1 : 1 0 ] } :
94 ( opcode [12:8 ]==5 ' b01011 ) ? {11 'h0 , s h i f t_ in [ 3 1 : 1 1 ] } :
95 ( opcode [12:8 ]==5 ' b01100 ) ? {12 'h0 , s h i f t_ in [ 3 1 : 1 2 ] } :
96 ( opcode [12:8 ]==5 ' b01101 ) ? {13 'h0 , s h i f t_ in [ 3 1 : 1 3 ] } :
97 ( opcode [12:8 ]==5 ' b01110 ) ? {14 'h0 , s h i f t_ in [ 3 1 : 1 4 ] } :
98 ( opcode [12:8 ]==5 ' b01111 ) ? {15 'h0 , s h i f t_ in [ 3 1 : 1 5 ] } :
99 ( opcode [12:8 ]==5 ' b10000 ) ? {16 'h0 , s h i f t_ in [ 3 1 : 1 6 ] } :
100 ( opcode [12:8 ]==5 ' b10001 ) ? {17 'h0 , s h i f t_ in [ 3 1 : 1 7 ] } :
101 ( opcode [12:8 ]==5 ' b10010 ) ? {18 'h0 , s h i f t_ in [ 3 1 : 1 8 ] } :
102 ( opcode [12:8 ]==5 ' b10011 ) ? {19 'h0 , s h i f t_ in [ 3 1 : 1 9 ] } :
103 ( opcode [12:8 ]==5 ' b10100 ) ? {20 'h0 , s h i f t_ in [ 3 1 : 2 0 ] } :
104 ( opcode [12:8 ]==5 ' b10101 ) ? {21 'h0 , s h i f t_ in [ 3 1 : 2 1 ] } :
105 ( opcode [12:8 ]==5 ' b10110 ) ? {22 'h0 , s h i f t_ in [ 3 1 : 2 2 ] } :
106 ( opcode [12:8 ]==5 ' b10111 ) ? {23 'h0 , s h i f t_ in [ 3 1 : 2 3 ] } :
107 ( opcode [12:8 ]==5 ' b11000 ) ? {24 'h0 , s h i f t_ in [ 3 1 : 2 4 ] } :
108 ( opcode [12:8 ]==5 ' b11001 ) ? {25 'h0 , s h i f t_ in [ 3 1 : 2 5 ] } :
109 ( opcode [12:8 ]==5 ' b11010 ) ? {26 'h0 , s h i f t_ in [ 3 1 : 2 6 ] } :
110 ( opcode [12:8 ]==5 ' b11011 ) ? {27 'h0 , s h i f t_ in [ 3 1 : 2 7 ] } :
111 ( opcode [12:8 ]==5 ' b11100 ) ? {28 'h0 , s h i f t_ in [ 3 1 : 2 8 ] } :
112 ( opcode [12:8 ]==5 ' b11101 ) ? {29 'h0 , s h i f t_ in [ 3 1 : 2 9 ] } :
113 ( opcode [12:8 ]==5 ' b11110 ) ? {30 'h0 , s h i f t_ in [ 3 1 : 3 0 ] } :
114 ( opcode [12:8 ]==5 ' b11111 ) ? {31 'h0 , s h i f t_ in [ 3 1 ] } :
115 32 'h0 ) :
116 sh i f t_ in ;
117
118 endmodule




2 // Author : Shashank Simha
3 // Date : 12/12/2017
4 // Un ive r s i ty : Rochester I n s t i t u t e o f Technology
5 // Desc r ip t i on : This i s a part o f the DSP implemented f o r grad p r o j e c t
6 //
////////////////////////////////////////////////////////////////////////////




11 sh i f t_ in ,
12 opcode ,
13 sh i f t_out ) ;
14 //−− 1 ISA Parameters
15 //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
16 parameter [ 7 : 0 ] MPY = 8 ' b00000000 ; // 1 .
17 parameter [ 7 : 0 ] MPYK = 8 ' b10100000 ; // 2 .
18 parameter [ 7 : 0 ] MAC = 8 ' h02 ; // 3 .
19 parameter [ 7 : 0 ] OR = 8 ' h03 ; // 4 .
20 parameter [ 7 : 0 ] XOR = 8 ' h04 ; // 5 .
21 parameter [ 1 5 : 0 ] SPAC = 16 ' h0100 ; // 6 .
22 parameter [ 2 : 0 ] SUB = 3 'h1 ; // 7 .
23 parameter [ 7 : 0 ] SUBS = 8 ' h05 ; // 8 .
24 parameter [ 2 : 0 ] ADD = 3 'h2 ; // 9 .
25 parameter [ 7 : 0 ] ADDS = 8 ' h06 ; // 10 .
26 parameter [ 7 : 0 ] AND = 8 ' h07 ; // 11 .
27 parameter [ 1 5 : 0 ] BU = 16 'h010F ; // 12 .
28 parameter [ 8 : 0 ] BANZ = 8 ' h01 ; // 13 .
29 parameter [ 1 5 : 0 ] BGEZ = 16 ' h0101 ; // 14 .
30 parameter [ 1 5 : 0 ] BGZ = 16 ' h0102 ; // 15 .
31 parameter [ 1 5 : 0 ] BLEZ = 16 ' h0103 ; // 16 .
32 parameter [ 1 5 : 0 ] BLZ = 16 ' h0104 ; // 17 .
33 parameter [ 1 5 : 0 ] BNZ = 16 ' h0105 ; // 18 .
34 parameter [ 1 5 : 0 ] BV = 16 ' h0106 ; // 19 .
35 parameter [ 1 5 : 0 ] BZ = 16 ' h0107 ; // 20 .
36 parameter [ 2 : 0 ] LAC = 3 'h3 ; // 21 .
37 parameter [ 7 : 0 ] LACK = 8 ' b10100001 ; // 22 .
38 parameter [ 7 : 0 ] LAR = 5 ' b11000 ; // 23 .
39 parameter [ 7 : 0 ] LARK = 5 ' b11001 ; // 24 .
40 parameter [ 7 : 0 ] LARKH = 5 ' b11011 ; // 25 .
41 parameter [ 7 : 0 ] LARP = 8 ' b10100011 ; // 26 .
42 parameter [ 7 : 0 ] LDP = 8 ' h09 ; // 27 .
I.1 RTL source code I-36
43 parameter [ 7 : 0 ] LDPK = 8 ' b10100100 ; // 28 .
44 parameter [ 7 : 0 ] LT = 8 ' h0a ; // 29 .
45 parameter [ 7 : 0 ] LTA = 8 'h0b ; // 30 .
46 parameter [ 7 : 0 ] LTD = 8 ' h0c ; // 31 .
47 parameter [ 7 : 0 ] LTP = 8 'h0d ; // 32 .
48 parameter [ 7 : 0 ] LTS = 8 ' h0e ; // 33 .
49 parameter [ 7 : 0 ] MAR = 8 ' h0f ; // 34 .
50 parameter [ 1 5 : 0 ] PAC = 16 'h011F ; // 35 .
51 parameter [ 1 5 : 0 ] ROVM = 16 'h012F ; // 36 .
52 parameter [ 2 : 0 ] SAC = 3 'h4 ; // 37 .
53 parameter [ 7 : 0 ] SAR = 5 ' b11010 ; // 38 .
54 parameter [ 1 5 : 0 ] SOVM = 16 'h013F ; // 39 .
55 parameter [ 7 : 0 ] TBLR = 8 ' h11 ; // 40 .
56 parameter [ 7 : 0 ] TBLW = 8 ' h12 ; // 41 .
57 parameter [ 1 5 : 0 ] NOP = 16 'h014F ; // 42 .
58 parameter [ 1 5 : 0 ] ZAC = 16 'h015F ; // 43 .
59 parameter [ 7 : 0 ] ZALH = 8 ' h13 ; // 44 .
60 parameter [ 7 : 0 ] ZALS = 8 ' h14 ; // 45 .
61 parameter [ 1 5 : 0 ] APAC = 16 'h016F ; // 46 .
62 parameter [ 7 : 0 ] CMPSIMD = 8 ' h15 ; // 47 .
63 parameter [ 7 : 0 ] SUBSIMD = 8 ' h16 ; // 48 .
64 parameter [ 7 : 0 ] ADDSIMD = 8 ' h17 ; // 49 .
65 parameter [ 1 5 : 0 ] PUSH = 16 'h017F ; // 50 .
66 parameter [ 1 5 : 0 ] POP = 16 'h018F ; // 51 .
67 parameter [ 1 5 : 0 ] CALL = 16 'h01AF ; // 52 .
68 parameter [ 1 5 : 0 ] RET = 16 'h019F ; // 53 .
69




74 output scan_out0 ;
75
76 input [ 3 1 : 0 ] s h i f t_ in ;
77 input [ 1 5 : 0 ] opcode ;
78
79 output [ 3 1 : 0 ] sh i f t_out ;
80
81 // Sh i f t s output f o r SAC
82 a s s i gn sh i f t_out = ( ( opcode [15:13]==SAC) | | ( opcode [15:13]==LAC) ) ?
83 ( ( opcode [12:8 ]==5 ' b00000 ) ? sh i f t_ in :
84 ( opcode [12:8 ]==5 ' b00001 ) ? {1 'h0 , s h i f t_ in [ 3 1 : 1 ] } :
85 ( opcode [12:8 ]==5 ' b00010 ) ? {2 'h0 , s h i f t_ in [ 3 1 : 2 ] } :
86 ( opcode [12:8 ]==5 ' b00011 ) ? {3 'h0 , s h i f t_ in [ 3 1 : 3 ] } :
87 ( opcode [12:8 ]==5 ' b00100 ) ? {4 'h0 , s h i f t_ in [ 3 1 : 4 ] } :
88 ( opcode [12:8 ]==5 ' b00101 ) ? {5 'h0 , s h i f t_ in [ 3 1 : 5 ] } :
89 ( opcode [12:8 ]==5 ' b00110 ) ? {6 'h0 , s h i f t_ in [ 3 1 : 6 ] } :
90 ( opcode [12:8 ]==5 ' b00111 ) ? {7 'h0 , s h i f t_ in [ 3 1 : 7 ] } :
91 ( opcode [12:8 ]==5 ' b01000 ) ? {8 'h0 , s h i f t_ in [ 3 1 : 8 ] } :
I.1 RTL source code I-37
92 ( opcode [12:8 ]==5 ' b01001 ) ? {9 'h0 , s h i f t_ in [ 3 1 : 9 ] } :
93 ( opcode [12:8 ]==5 ' b01010 ) ? {10 'h0 , s h i f t_ in [ 3 1 : 1 0 ] } :
94 ( opcode [12:8 ]==5 ' b01011 ) ? {11 'h0 , s h i f t_ in [ 3 1 : 1 1 ] } :
95 ( opcode [12:8 ]==5 ' b01100 ) ? {12 'h0 , s h i f t_ in [ 3 1 : 1 2 ] } :
96 ( opcode [12:8 ]==5 ' b01101 ) ? {13 'h0 , s h i f t_ in [ 3 1 : 1 3 ] } :
97 ( opcode [12:8 ]==5 ' b01110 ) ? {14 'h0 , s h i f t_ in [ 3 1 : 1 4 ] } :
98 ( opcode [12:8 ]==5 ' b01111 ) ? {15 'h0 , s h i f t_ in [ 3 1 : 1 5 ] } :
99 ( opcode [12:8 ]==5 ' b10000 ) ? {16 'h0 , s h i f t_ in [ 3 1 : 1 6 ] } :
100 ( opcode [12:8 ]==5 ' b10001 ) ? {17 'h0 , s h i f t_ in [ 3 1 : 1 7 ] } :
101 ( opcode [12:8 ]==5 ' b10010 ) ? {18 'h0 , s h i f t_ in [ 3 1 : 1 8 ] } :
102 ( opcode [12:8 ]==5 ' b10011 ) ? {19 'h0 , s h i f t_ in [ 3 1 : 1 9 ] } :
103 ( opcode [12:8 ]==5 ' b10100 ) ? {20 'h0 , s h i f t_ in [ 3 1 : 2 0 ] } :
104 ( opcode [12:8 ]==5 ' b10101 ) ? {21 'h0 , s h i f t_ in [ 3 1 : 2 1 ] } :
105 ( opcode [12:8 ]==5 ' b10110 ) ? {22 'h0 , s h i f t_ in [ 3 1 : 2 2 ] } :
106 ( opcode [12:8 ]==5 ' b10111 ) ? {23 'h0 , s h i f t_ in [ 3 1 : 2 3 ] } :
107 ( opcode [12:8 ]==5 ' b11000 ) ? {24 'h0 , s h i f t_ in [ 3 1 : 2 4 ] } :
108 ( opcode [12:8 ]==5 ' b11001 ) ? {25 'h0 , s h i f t_ in [ 3 1 : 2 5 ] } :
109 ( opcode [12:8 ]==5 ' b11010 ) ? {26 'h0 , s h i f t_ in [ 3 1 : 2 6 ] } :
110 ( opcode [12:8 ]==5 ' b11011 ) ? {27 'h0 , s h i f t_ in [ 3 1 : 2 7 ] } :
111 ( opcode [12:8 ]==5 ' b11100 ) ? {28 'h0 , s h i f t_ in [ 3 1 : 2 8 ] } :
112 ( opcode [12:8 ]==5 ' b11101 ) ? {29 'h0 , s h i f t_ in [ 3 1 : 2 9 ] } :
113 ( opcode [12:8 ]==5 ' b11110 ) ? {30 'h0 , s h i f t_ in [ 3 1 : 3 0 ] } :
114 ( opcode [12:8 ]==5 ' b11111 ) ? {31 'h0 , s h i f t_ in [ 3 1 ] } :
115 32 'h0 ) :
116 sh i f t_ in ;
117 endmodule
I.1 RTL source code I-38
I.1.5 Compare select unit
1 //
////////////////////////////////////////////////////////////////////////////
2 // Author : Shashank Simha
3 // Date : 12/12/2017
4 // Un ive r s i ty : Rochester I n s t i t u t e o f Technology
5 // Desc r ip t i on : This i s a part o f the DSP implemented f o r grad p r o j e c t
6 //
////////////////////////////////////////////////////////////////////////////
7 module compare_select ( scan_in0 , scan_en , test_mode , scan_out0 , A,B, f l ag ,C,
en ) ;
8
9 input scan_in0 , scan_en , test_mode ;
10 output scan_out0 ;
11
12 input [ 7 : 0 ] A,B;
13 input f l ag , en ;
14
15 output [ 7 : 0 ] C;
16




I.1 RTL source code I-39
I.1.6 Multiplier
1 module mu l t i p l i e r ( scan_in0 , scan_out0 , scan_en , test_mode , a , b , ov ,
product ) ;
2
3 input scan_in0 , scan_en , test_mode ;
4 output scan_out0 ;
5 input [ 1 5 : 0 ] a , b ;
6 output [ 3 1 : 0 ] product ;
7 input ov ;
8
9
10 wire [ 1 5 : 0 ] abs_a , abs_b ;
11 wire [ 1 5 : 0 ] twos_comp_a , twos_comp_b ;
12
13 parameter PSAT = 32 ' h 7 f f f f f f f ;
14 parameter NSAT = 32 ' h80000000 ;
15
16 wire [ 3 1 : 0 ] abs_resu l t ;
17
18 a s s i gn twos_comp_a = ((~ a ) + 1) ;
19 a s s i gn twos_comp_b = ((~b) + 1) ;
20
21 a s s i gn abs_a = a [ 1 5 ] ?
22 ( ov ? ( ( a == NSAT) ? PSAT : twos_comp_a) : twos_comp_a) : a ;
23 a s s i gn abs_b = b [ 1 5 ] ?
24 ( ov ? ( ( b == NSAT) ? PSAT : twos_comp_b) : twos_comp_b) : b ;
25 a s s i gn abs_resu l t = abs_a ∗ abs_b ;
26 a s s i gn product = ( a [ 1 5 ] ^ b [ 1 5 ] ) ? ((~ abs_resu l t ) + 1) : abs_resu l t ;
27
28 endmodule




2 // Author : Shashank Simha
3 // Date : 12/12/2017
4 // Un ive r s i ty : Rochester I n s t i t u t e o f Technology
5 // Desc r ip t i on : This i s a part o f the DSP implemented f o r grad p r o j e c t
6 //
////////////////////////////////////////////////////////////////////////////
7 module adder ( scan_in0 , scan_en , test_mode , scan_out0 , A,B, Cin , Cout ,Sum) ;
8
9 input scan_in0 , scan_en , test_mode ;
10 output scan_out0 ;
11
12 input [ 7 : 0 ] A,B;
13 input Cin ;
14 output [ 7 : 0 ] Sum;
15 output Cout ;
16
17 wire [ 7 : 0 ] G;
18 wire [ 7 : 0 ] P ;
19 wire [ 7 : 0 ] C;
20
21 a s s i gn {Cout ,Sum}= A+B+Cin ;
22
23 endmodule
I.2 Assembler designed in Perl I-41
I.2 Assembler designed in Perl
1 ##################################################################
2 # Author : Shashank Simha
3 # Date : 12/12/2017
4 # Unive r s i ty : Rochester I n s t i t u t e o f Technology
5 # Desc r ip t i on : This i s a part o f the ASSEMBLER fo r DSP implemented
6 # f o r grad p r o j e c t
7 ##################################################################
8 # use s t r i c t ;
9 # use warnings ;
10
11 my $name = " med ian_f i l t e r_ f i r s t_t ry " ;
12
13 my $assembly_f i l e= $name . " . txt " ;
14 my $mi f_ f i l e = $name . " . mif " ;
15 my $hex_f i l e = $name . " . hex " ;
16
17 my @code ;
18 my @comments ;
19
20 my @line_number ;
21 my $l ine_count ;
22 my $error_count ;
23
24 my @code_rearr ;
25
26 my @opcode_extracted ;
27 my @non_opcode_extracted ;
28 my %jmp_in_label_extracted ;
29 my %jmp_out_label_extracted ;
30 my @jmp_labels_called ;
31 my $ j =0 ;
32
33 my @opcode_generated ;
34 my @non_opcode_generated ;
35 my @jmp_addr_generated ;
36
37 my @jmp_addr_generated_nonconverted ;
38
39
40 my $mode ;
41 my $dir_addr ;
42 my $ar ;
43 my $narp ;
44 my $constant ;
45 my $incr_oper ;
46 my $ s h i f t ;
I.2 Assembler designed in Perl I-42
47 #####################
48 my $g r e a t e r_ l e s s e r ;
49 #####################
50
51 my $jmp_label ;
52
53
54 my %add_sub_ar= (
55 " ∗ " => " 00 " ,
56 "∗+" => " 01 " ,
57 "∗−" => " 10 "
58 ) ;
59
60 my %opcode_3bit = (
61 "SUB" => " 001 " ,
62 "ADD" => " 010 " ,
63 "LAC" => " 011 " ,
64 "SAC" => " 100 "
65 ) ;
66
67 my %opcode_5bit = (
68 "LARKH" => " 11011 " ,
69 "LARK" => " 11001 " ,
70 "LAR" => " 11000 " ,
71 "SAR" => " 11010 "
72 ) ;
73
74 my %opcode_7bit = (
75 "CMPSIMD" => " 0001101 " ,
76 ) ;
77 my %opcode_8bit = (
78 #KEY => 01234567
79 "MPY" => " 00000000 " ,
80 "MPYK" => " 10100000 " ,
81 "MAC" => " 00000010 " ,
82 "OR" => " 00000011 " ,
83 "XOR" => " 00000100 " ,
84 "SUBS" => " 00000101 " ,
85 "ADDS" => " 00000110 " ,
86 "AND" => " 00000111 " ,
87 "LACK" => " 10100001 " ,
88 "LARP" => " 10100011 " ,
89 "LDP" => " 00001001 " ,
90 "LDPK" => " 10100100 " ,
91 "LT" => " 00001010 " ,
92 "LTA" => " 00001011 " ,
93 "LTD" => " 00001100 " , # 2 cy c l e
94 "LTP" => " 00001101 " ,
95 "LTS" => " 00001110 " ,
I.2 Assembler designed in Perl I-43
96 "MAR" => " 00001111 " ,
97 "SOVM" => " 00000001 " ,
98 "TBLR" => " 00010001 " ,
99 "TBLW" => " 00010010 " ,
100 "ZALH" => " 00010011 " ,
101 "ZALS" => " 00010100 " ,
102 "SUBSSIMD"=> " 00010110 " ,
103 "ADDSSIMD"=> " 00010111 " ,
104 "BANZ" => " 00011000 " ,
105 ) ;
106 my %opcode_16bit = (
107 #KEY => 0123456789ABCDEF
108 "SPAC"=>" 0000000100000000 " ,
109 "BU" =>" 0000000100001111 " ,
110 "BGEZ"=>" 0000000100000001 " ,
111 "BGZ" =>" 0000000100000010 " ,
112 "BLEZ"=>" 0000000100000011 " ,
113 "BLZ" =>" 0000000100000100 " ,
114 "BNZ" =>" 0000000100000101 " ,
115 "BV" =>" 0000000100000110 " ,
116 "BZ" =>" 0000000100000111 " ,
117 "PAC" =>" 0000000100011111 " ,
118 "ROVM"=>" 0000000100101111 " ,
119 "SOVM"=>" 0000000100111111 " ,
120 "NOP" =>" 0000000101001111 " ,
121 "ZAC" =>" 0000000101011111 " ,
122 "APAC"=>" 0000000101101111 " ,
123 "POP" =>" 0000000101111111 " ,
124 "PUSH"=>" 0000000110001111 " ,
125 "RET" =>" 0000000110011111 " ,




130 open ( my $ in_ f i l e , '<:encoding (UTF−8) ' , $a s s embly_f i l e ) or d i e " \ t Error :
Assembly input f i l e not found ! \ n " ;
131 whi l e (<$ in_ f i l e >){
132 chomp $_ ;
133 $l ine_count++;
134 i f ( ($_ =~ /^( .∗? ) ; ( . ∗ ) /) && ! ( $_ =~ /^[\ s ]∗\/\//) ) {
135 my $code =$1 ;
136 $code =~ s /^\ s+//;
137 push (@code , $code ) ;
138 push (@comments , $2 ) ;
139 push (@line_number , $ l ine_count ) ;
140 }
141 e l s i f ( ($_ =~ /^[\ s ]∗\/\//) | | ($_ =~ /^[\ s ]∗/ ) ) {
142 }
143 e l s e {
I.2 Assembler designed in Perl I-44
144 $error_count++;




149 # Machine code gene ra t i on
150 #################################
151
152 fo r each (my $ i =0; $i< @code ; $ i= $ i+1){
153 $code_rearr [ $ j ] = $code [ $ i ] ;
154 i f ( $code [ $ i ] =~ /^( .∗? ) [ \ s ] ∗ : [ \ s ] ∗ ( . ∗ ) /) {
155 $jmp_in_label_extracted{$1} = $ j ;
156 $code [ $ i ] = $2 ;
157 }
158
159 i f ( $code [ $ i ] =~ /^( .∗? ) [ \ s ]+( .∗ ) /) {
160 chomp $1 ;
161 chomp $2 ;
162 $opcode_extracted [ $ i ] = uc ( $1 ) ;
163 $non_opcode_extracted [ $ i ] = $2 ;
164 i f ( e x i s t s $opcode_16bit { $opcode_extracted [ $ i ] } ) {
165 $opcode_generated [ $ j ] = $opcode_16bit {
$opcode_extracted [ $ i ] } ;
166 i f ( $opcode_extracted [ $ i ] eq "BU" | | #1
167 $opcode_extracted [ $ i ] eq "BGEZ" | | #2
168 $opcode_extracted [ $ i ] eq "BGZ" | | #3
169 $opcode_extracted [ $ i ] eq "BLEZ" | | #4
170 $opcode_extracted [ $ i ] eq "BLZ" | | #5
171 $opcode_extracted [ $ i ] eq "BNZ" | | #6
172 $opcode_extracted [ $ i ] eq "BV" | | #7
173 $opcode_extracted [ $ i ] eq "BZ" | | #8
174 $opcode_extracted [ $ i ] eq "CALL" ) { #9
175
176 $ j++;
177 i f ( $non_opcode_extracted [ $ i ] =~ /^[\ s
] ∗ ( . ∗ ? ) [ \ s ]∗ $/ i ) {
178 push ( @jmp_labels_called , $1 ) ;
179 push @{$jmp_out_label_extracted{$1}}
, $ j ;




182 # pr in t " Di rec t address $1 found ;
machine code= $opcode_generated [
$ i ] . $non_opcode_generated [ $ i ]
found in l i n e " . $line_number [ $i
−1 ] . "\ n " ;
183 }
I.2 Assembler designed in Perl I-45
184 e l s e {
185 $error_count++;
186 p r in t "ERROR : Inva l i d I n s t r u c t i o n
$code [ $ i ] in l i n e " . $line_number [







192 e l s i f ( e x i s t s $opcode_7bit { $opcode_extracted [ $ i ] } ) {
193 $opcode_generated [ $ j ] = $opcode_7bit {
$opcode_extracted [ $ i ] } ;
194 i f ( $non_opcode_extracted [ $ i ] =~ /^0x([0−9A−
F][0−9A−F ] ) [ \ s ] ∗ , [ \ s ] ∗ ( [GL] ) [ \ s ]∗ $/ i ) {
195 $dir_addr = $1 ;
196 $g r e a t e r_ l e s s e r= $2 ;
197 $mode = " 0 " ;
198
199 i f ( $ g r e a t e r_ l e s s e r eq "G" ) {
$g r e a t e r_ l e s s e r = " 1 " ;
200 }
201 e l s e {
$g r e a t e r_ l e s s e r = " 0 " ;
202 }
203
204 $non_opcode_generated [ $ j ] =
$g r e a t e r_ l e s s e r . $mode . s p r i n t f ( "
%07b" , hex ( $1 ) ) ;
205 # pr in t " Di rec t address $1 found ;
machine code= $opcode_generated [
$ j ] . $non_opcode_generated [ $ j ]
found in l i n e " . $line_number [ $i
−1 ] . "\ n " ;
206 }
207 e l s i f ( $non_opcode_extracted [ $ i ] =~ /^[\ s ]∗
ar ( . ∗ ? ) [ \ s ] ∗ , [ \ s ] ∗ ( . ∗ ? ) [ \ s ] ∗ , [ \ s ] ∗ ( [GL] )
[ \ s ]∗ $/ i ) {
208 $incr_oper = $2 ;
209 $ar = $1 ;
210 $g r e a t e r_ l e s s e r= $3 ;
211 $mode = " 1 " ;
212
213 i f ( $ g r e a t e r_ l e s s e r eq "G" ) {
$g r e a t e r_ l e s s e r = " 1 " ;
214 }
I.2 Assembler designed in Perl I-46
215 e l s e {
$g r e a t e r_ l e s s e r = " 0 " ;
216 }
217
218 i f ( ( e x i s t s $add_sub_ar{ $incr_oper })
&& ( $ar <= 7) ) {
219 $non_opcode_generated [ $ j ] =
$g r e a t e r_ l e s s e r . $mode .
$add_sub_ar{ $incr_oper } . "
00 " . s p r i n t f ( "%03b" , $ar )
;
220 }
221 e l s e {
222 $error_count++;
223 p r in t "ERROR : Inva l i d
I n s t r u c t i o n $code [ $ i ] in




226 e l s e {
227 $error_count++;
228 p r in t "ERROR : Inva l i d I n s t r u c t i o n
$code [ $ i ] in l i n e " . $line_number [





232 e l s i f ( e x i s t s $opcode_8bit { $opcode_extracted [ $ i ] } ) {
233 $opcode_generated [ $ j ] = $opcode_8bit {
$opcode_extracted [ $ i ] } ;
234 i f ( $opcode_extracted [ $ i ] eq "LDPK" | |
235 $opcode_extracted [ $ i ] eq "LACK" | |
236 $opcode_extracted [ $ i ] eq "MPYK" ) {
237 i f ( $non_opcode_extracted [ $ i ] =~ /^0x([0−9A−
F][0−9A−F ] ) $/ i ) {
238 $constant = $1 ;
239 $non_opcode_generated [ $ j ] = s p r i n t f
( "%08b" , hex ( $1 ) ) ;
240 # pr in t " constant $1 found ; machine
code= $opcode_generated [ $ j ] .
$non_opcode_generated [ $ j ] found
in l i n e " . $line_number [ $i −1 ] . "\ n
" ;
241 }
242 e l s e {
243 $error_count++;
I.2 Assembler designed in Perl I-47
244 p r in t "ERROR : Constant $constant
i n v a l i d $code [ $ i ] in l i n e " .
$line_number [ $ i ] . " \n " ;
245 }
246 }
247 e l s i f ( $opcode_extracted [ $ i ] eq "LARP" ) {
248 i f ( $non_opcode_extracted [ $ i ] =~ /^[\ s ]∗ ar
( . ∗ ? ) [ \ s ]∗ $/ i ) {
249 $constant = $1 ;
250 $non_opcode_generated [ $ j ] = s p r i n t f
( "%08b" , hex ( $1 ) ) ;
251 # pr in t " constant $1 found ; machine
code= $opcode_generated [ $ j ] .
$non_opcode_generated [ $ j ] found
in l i n e " . $line_number [ $i −1 ] . "\ n
" ;
252 }
253 e l s e {
254 $error_count++;
255 p r in t "ERROR : Constant $constant
i n v a l i d $code [ $ i ] in l i n e " .
$line_number [ $ i ] . " \n " ;
256 }
257 }
258 e l s i f ( $opcode_extracted [ $ i ] eq "BANZ" ) {
259 $ j++;
260 i f ( $non_opcode_extracted [ $ i ] =~ /^[\ s
] ∗ ( . ∗ ? ) [ \ s ] ∗ , [ \ s ]∗ ar ( . ∗ ? ) [ \ s ] ∗ , [ \ s ] ∗ ( . ∗ ? )
[ \ s ]∗ $/ i ) {
261 push ( @jmp_labels_called , $1 ) ;
262 push @{$jmp_out_label_extracted{$1}}
, $ j ;
263 # pr in t "Found $1 at $ i array @{
$jmp_out_label_extracted{$1}} \n
" ;
264 $incr_oper = $3 ;
265 $ar = $2 ;
266 $mode = " 1 " ;
267 i f ( ( e x i s t s $add_sub_ar{ $incr_oper })
&& ( $ar <= 7) ) {
268 $non_opcode_generated [ $j −1]
= $mode . $add_sub_ar{
$incr_oper } . " 00 " . s p r i n t f
( "%03b" , $ar ) ;
269 # pr in t " I nd i r e c t address
found ; narp= ar$ar &
operat ion= $add_sub_ar{
$incr_oper } machine code=
$opcode_generated [ $ j ] .
I.2 Assembler designed in Perl I-48
$non_opcode_generated [ $ j ]
found in l i n e " .
$line_number [ $i −1 ] . "\ n " ;
270 }
271 e l s e {
272 $error_count++;
273 p r in t "ERROR : Inva l i d
I n s t r u c t i o n $code [ $ i ] in




276 e l s i f ( $non_opcode_extracted [ $ i ] =~ /^[\ s
] ∗ ( . ∗ ? ) [ \ s ]∗ $/ i ) {
277 push ( @jmp_labels_called , $1 ) ;
278 push @{$jmp_out_label_extracted{$1}}
, $ j ;




281 $mode = " 0 " ;
282 $non_opcode_generated [ $j −1] = $mode .
s p r i n t f ( "%07b" , hex (0 ) ) ;
283 # pr in t " Di rec t address $1 found ;
machine code= $opcode_generated [
$ j ] . $non_opcode_generated [ $ j ]
found in l i n e " . $line_number [ $i
−1 ] . "\ n " ;
284 }
285 e l s e {
286 $error_count++;
287 p r in t "ERROR : Inva l i d I n s t r u c t i o n
$code [ $ i ] in l i n e " . $line_number [
$ i ] . " \n " ;
288 }
289 }
290 e l s e {
291 i f ( $non_opcode_extracted [ $ i ] =~ /^0x([0−9A−
F][0−9A−F ] ) [ \ s ]∗ $/ i ) {
292 $dir_addr = $1 ;
293 $mode = " 0 " ;
294 $non_opcode_generated [ $ j ] = $mode .
s p r i n t f ( "%07b" , hex ( $1 ) ) ;
295 # pr in t " Di rec t address $1 found ;
machine code= $opcode_generated [
$ j ] . $non_opcode_generated [ $ j ]
found in l i n e " . $line_number [ $i
−1 ] . "\ n " ;
I.2 Assembler designed in Perl I-49
296 }
297 e l s i f ( $non_opcode_extracted [ $ i ] =~ /^[\ s ]∗
ar ( . ∗ ? ) [ \ s ] ∗ , [ \ s ] ∗ ( . ∗ ? ) [ \ s ]∗ $/ i ) {
298 $incr_oper = $2 ;
299 $ar = $1 ;
300 $mode = " 1 " ;
301 i f ( ( e x i s t s $add_sub_ar{ $incr_oper })
&& ( $ar <= 7) ) {
302 $non_opcode_generated [ $ j ] =
$mode . $add_sub_ar{
$incr_oper } . " 00 " . s p r i n t f
( "%03b" , $ar ) ;
303 # pr in t " I nd i r e c t address
found ; narp= ar$ar &
operat ion= $add_sub_ar{
$incr_oper } machine code=
$opcode_generated [ $ j ] .
$non_opcode_generated [ $ j ]
found in l i n e " .
$line_number [ $i −1 ] . "\ n " ;
304 }
305 e l s e {
306 $error_count++;
307 p r in t "ERROR : Inva l i d
I n s t r u c t i o n $code [ $ i ] in




310 e l s e {
311 $error_count++;
312 p r in t "ERROR : Inva l i d I n s t r u c t i o n
$code [ $ i ] in l i n e " . $line_number [




316 e l s i f ( e x i s t s $opcode_5bit { $opcode_extracted [ $ i ] } ) {
317 $opcode_generated [ $ j ] = $opcode_5bit {
$opcode_extracted [ $ i ] } ;
318 i f ( $opcode_extracted [ $ i ] eq "LARK" ) {
319 i f ( $non_opcode_extracted [ $ i ] =~ /^[\ s ]∗ ar
( . ∗ ? ) [ \ s ] ∗ , [ \ s ]∗0 x([0−9A−F][0−9A−F ] ) [ \ s ]∗
$/ i ) {
320 $ar = $1 ;
321 $constant = $2 ;
322 i f ( $ar <= 7) {
323 $non_opcode_generated [ $ j ] =
s p r i n t f ( "%03b" , $ar ) .
I.2 Assembler designed in Perl I-50
s p r i n t f ( "%08b" , hex (
$constant ) ) ;
324 # pr in t " constant $2 found
f o r ar$1 ; machine code=
$opcode_generated [ $ j ] .
$non_opcode_generated [ $ j ]
$code [ $ i ] in l i n e " .
$line_number [ $ i ] . " \ n " ;
325 }
326 e l s e {
327 $error_count++;
328 p r in t "ERROR : Inva l i d
I n s t r u c t i o n $code [ $ i ] in





332 e l s e {
333 i f ( ( $non_opcode_extracted [ $ i ] =~ /^[\ s ]∗ ar
( . ∗ ? ) [ \ s ] ∗ , [ \ s ]∗0 x([0−9A−F][0−9A−F ] ) [ \ s ]∗
$/ i )&& ( $ar_n <= 7) ) {
334 my $ar_n = $1 ;
335 $dir_addr = $2 ;
336 $mode = " 0 " ;
337 $non_opcode_generated [ $ j ] = s p r i n t f
( "%03b" , hex ( $ar_n ) ) . $mode .
s p r i n t f ( "%07b" , hex ( $2 ) ) ;
338 # pr in t " Di rec t address $1 found ;
machine code= $opcode_generated [
$ j ] . $non_opcode_generated [ $ j ]
found in l i n e " . $line_number [ $i
−1 ] . "\ n " ;
339 }
340 e l s i f ( $non_opcode_extracted [ $ i ] =~ /^[\ s ]∗
ar ( . ∗ ? ) [ \ s ] ∗ , [ \ s ] ∗ ( . ∗ ? ) [ \ s ] ∗ , [ \ s ]∗ ar ( . ∗ ? )
[ \ s ]∗ $/ i ) {
341 my $ar_n = $1 ;
342 $incr_oper = $2 ;
343 $ar = $3 ;
344 $mode = " 1 " ;
345 i f ( ( e x i s t s $add_sub_ar{ $incr_oper })
&& ( $ar <= 7)&& ($ar_n <= 7) ) {
346 $non_opcode_generated [ $ j ] =
$mode . s p r i n t f ( "%03b" ,
$ar_n ) . $add_sub_ar{
$incr_oper } . " 00 " . s p r i n t f
( "%03b" , $ar ) ;
I.2 Assembler designed in Perl I-51
347 # pr in t " I nd i r e c t address
found ; narp= ar$ar &
operat ion= $add_sub_ar{
$incr_oper } machine code=
$opcode_generated [ $ j ] .
$non_opcode_generated [ $ j ]
found in l i n e " .
$line_number [ $i −1 ] . "\ n " ;
348 }
349 e l s e {
350 $error_count++;
351 p r in t "ERROR : Inva l i d
I n s t r u c t i o n $code [ $ i ] in




354 e l s e {
355 $error_count++;
356 p r in t "ERROR : Inva l i d I n s t r u c t i o n
$code [ $ i ] in l i n e " . $line_number [




360 e l s i f ( e x i s t s $opcode_3bit { $opcode_extracted [ $ i ] } ) {
361 $opcode_generated [ $ j ] = $opcode_3bit {
$opcode_extracted [ $ i ] } ;
362 i f ( $non_opcode_extracted [ $ i ] =~ /^0x([0−9A−F][0−9A−
F ] ) [ \ s ] ∗ , [ \ s ]∗ ( \ d+) [ \ s ]∗ $/ i ) {
363 $dir_addr = $1 ;
364 $mode = " 0 " ;
365 $ s h i f t = $2 ;
366 $non_opcode_generated [ $ j ] = s p r i n t f ( "%05b" ,
$ s h i f t ) . $mode . s p r i n t f ( "%07b" , hex ( $1 ) ) ;
367 # pr in t " Di rec t address $1 found ; machine
code= $opcode_generated [ $ j ] .
$non_opcode_generated [ $ j ] found in l i n e
" . $line_number [ $i −1 ] . "\ n " ;
368 }
369 e l s i f ( $non_opcode_extracted [ $ i ] =~ /^[\ s ]∗ ar ( . ∗ ? ) [ \
s ] ∗ , [ \ s ] ∗ ( . ∗ ? ) [ \ s ] ∗ , [ \ s ] ∗ ( . ∗ ? ) [ \ s ]∗ $/ i ) {
370 $incr_oper = $2 ;
371 $ s h i f t = $3 ;
372 $ar = $1 ;
373 $mode = " 1 " ;
374 i f ( ( e x i s t s $add_sub_ar{ $incr_oper })&& ( $ar
<= 7) ) {
I.2 Assembler designed in Perl I-52
375 $non_opcode_generated [ $ j ] = s p r i n t f
( "%05b" , $ s h i f t ) . $mode .
$add_sub_ar{ $incr_oper } . " 00 " .
s p r i n t f ( "%03b" , $ar ) ;
376 # pr in t " I nd i r e c t address found ;
narp= ar$ar & operat i on=
$add_sub_ar{ $incr_oper } machine
code= $opcode_generated [ $ j ] .
$non_opcode_generated [ $ j ] found
in l i n e " . $line_number [ $i −1 ] . "\ n
" ;
377 }
378 e l s e {
379 $error_count++;
380 p r in t "ERROR : Inva l i d I n s t r u c t i o n
$code [ $ i ] in l i n e " . $line_number [
$ i ] . " \n " ;
381 }
382 }
383 e l s e {
384 $error_count++;
385 p r in t "ERROR : Inva l i d I n s t r u c t i o n $code [ $ i ]




389 e l s e {
390 $error_count++;
391 p r i n t "ERROR : Opcode doesn ' t e x i s t $code [ $ i ] in
l i n e " . $line_number [ $ i ] . " \n " ;
392 }
393 # pr in t $opcode_extracted [ $ i ] . " \ t " .
$non_opcode_extracted [ $ i ] . " \ t ; " . $opcode_generated
[ $ j ] . " \ t " . $non_opcode_generated [ $ j ] . " \ n " ;
394 $ j++;
395 }
396 e l s i f ( e x i s t s $opcode_16bit {uc ( $code [ $ i ] ) }) {
397 $opcode_generated [ $ j ] = $opcode_16bit {uc ( $code [ $ i ] ) } ;
398 $ j++;
399 }
400 e l s e {
401 $error_count++;
402 p r i n t "ERROR : I n s t r u c t i o n $code [ $ i ] in l i n e " . $line_number [ $ i ] . "




406 # Jump address gene ra t i on
407 #################################
I.2 Assembler designed in Perl I-53
408
409 fo r each my $k ( @jmp_labels_called ) {
410 i f ( e x i s t s $jmp_in_label_extracted{$k }) {
411 my @temp_arr = @{$jmp_out_label_extracted{$k }} ;
412 # pr in t " Label $k found in : $jmp_in_label_extracted {$k}\n " ;
413 fo r each my $jmp_addr (@temp_arr ) {
414 $jmp_addr_generated [ $jmp_addr ] = s p r i n t f ( "%016b" ,
$jmp_in_label_extracted{$k }) ;






420 i f ( $error_count == 0) {
421 #
#########################################################################################
422 open ( my $out_f i l e , '> ' , $m i f_ f i l e ) or d i e " \ t Error : Output MIF f i l e
not found and can ' t be c reated ! \ n " ;
423 p r i n t $ou t_ f i l e "WIDTH = 16;\nDEPTH = 65536;\n\nADDRESS_RADIX = DEC;
% Can be HEX, BIN or DEC %\nDATA_RADIX = BIN ; % Can be HEX,
BIN or DEC%\n\n\nCONTENT BEGIN\n" ;
424
425 f o r (my $ i =0 ; $ i < $ j ; $ i++){
426 my $ in t = $opcode_generated [ $ i ] . $non_opcode_generated [ $ i ] .
$jmp_addr_generated [ $ i ] ;
427 $ in t = unpack ( "N" , pack ( "B32 " , subs t r ( " 0 " x 32 . $ int , −32) ) )
;
428 my $hex = s p r i n t f ( "%04x " , $ in t ) ;
429 p r i n t $ou t_ f i l e $ i . " : " . $hex . " ;%\ t \ t " . $code_rearr [ $ i ] .
$jmp_addr_generated_nonconverted [ $ i ] . "%\n" ;
430 # pr in t $ou t_ f i l e $ i . " : " . $opcode_generated [ $ i ] .
$non_opcode_generated [ $ i ] . $jmp_addr_generated [ $ i ] . " ;%\ t \ t " . $code_rearr [ $ i
] . $jmp_addr_generated_nonconverted [ $ i ] . "%\n " ;
431 }
432 p r i n t $ou t_ f i l e "END; " ;
433
434 c l o s e ( $ou t_ f i l e ) ;
435 p r i n t " \nASSEMBLY SUCCESSFUL : Su c c e s s f u l l y assembled $as sembly_f i l e
. \ n\ tP l ea s e check $m i f_ f i l e f o r the output . \ n " ;
436 #
#########################################################################################
437 open ( my $hex_fi le_out , '> ' , $hex_f i l e ) or d i e " \ t Error : Output HEX
f i l e not found and can ' t be c reated ! \ n " ;
438
439 f o r (my $ i =0 ; $ i < $ j ; $ i++){
I.2 Assembler designed in Perl I-54
440 my $ in t = $opcode_generated [ $ i ] . $non_opcode_generated [ $ i ] .
$jmp_addr_generated [ $ i ] ;
441 $ in t = unpack ( "N" , pack ( "B32 " , subs t r ( " 0 " x 32 . $ int , −32) ) )
;
442 my $hex = s p r i n t f ( "%04x " , $ in t ) ;
443 p r i n t $hex_fi le_out $hex . " \n " ;
444 }
445
446 c l o s e ( $hex_fi le_out ) ;




450 e l s e {
451 p r i n t " \nASSEMBLY FAILED : Total o f $error_count e r r o r s found . Please
check your code\n " ;
452 }
I.3 Assembly source code for testing and median filter I-55
I.3 Assembly source code for testing and median filter
I.3.1 Assembly code used for basic level testing
//
//////////////////////////////////////////////////////////////////////////////////////////





lack 0x01 ; // acc= 1 arp=x ar0=0 ar1=0
sac 0x00 , 0 ; // acc= 1 arp=x ar0=0 ar1=0 mem[0]=1
lack 0x03 ; // acc= 3 arp=x ar0=0 ar1=0 mem[0]=1
l a rk ar0 , 0x00 ; // acc= 3 arp=0 ar0=0 ar1=0 mem[0]=1
add ∗+ ,0 , ar0 ; // acc= 4 arp=0 ar0=1 ar1=0 mem[0]=1
sac ∗− ,0 , ar0 ; // acc= 4 arp=0 ar0=0 ar1=0 mem[0]=1
mem[1]=4
sub ∗+ ,0 , ar1 ; // acc= 3 arp=0 ar0=0 ar1=0 mem[0]=1
mem[1]=4
l a rk ar1 , 0x10 ; // acc= 3 arp=1 ar0=1 ar1=10 mem[0]=1
mem[1]=4
I.3 Assembly source code for testing and median filter I-56
sac ∗ , 0 , ar0 ; // acc= 3 arp=1 ar0=1 ar1=10 mem[0 ]=1 ,
mem[1 ]=4 ,mem[10]=3
l t ∗ , ar1 ; // acc= 3 arp=0 ar0=1 ar1=10 mem[0 ]=1 ,
mem[1 ]=4 ,mem[10]=3 ; Treg=4
mpy ∗ , ar0 ; // acc= 3 arp=1 ar0=1 ar1=10 mem[0 ]=1 ,
mem[1 ]=4 ,mem[10]=3 ; Treg=4 Preg=C
mac ∗−, ar0 ; // acc= f arp=0 ar0=1 ar1=10 mem[0 ]=1 ,
mem[1 ]=4 ,mem[10]=3 ; Treg=4 Preg=f ( acc=c+acc ; Preg=4∗mem[
ar0 ] )
mpyk 0x02 ; // acc= f arp=0 ar0=1 ar1=10 mem[0 ]=1 ,
mem[1 ]=4 ,mem[10]=3 ; Treg=4 Preg=8
pac ;
L1 : sub ∗ , 0 , ar0 ;
bnz L1 ;
bz L2 ;
L3 : or ∗ , ar1 ;
xor ∗ , ar1 ;
r e t ;




I.3 Assembly source code for testing and median filter I-57
I.3.2 Assembly code used for median filter algorithm
LAR AR0, 0x0003 ;
LAR AR1, 0x0004 ;
LAR AR2, 0x0005 ;
LAR AR3, 0x0006 ;
LAR AR4, 0x0007 ;
LAR AR5, 0x0008 ;
LAR AR7, 0x0009 ;
LARP AR7;
LAC 0x0000 , 0 ;
SAC AR7, ∗− ,0;
LACK 0x01 ;
SAC AR7, ∗+ ,0;
MAR AR7, ∗+;
LAC 0x0001 , 0 ;
SAC AR7, ∗+ ,0;
LACK 0x01 ;
SAC AR7,∗− ,0 ;
MAR AR0, ∗−;




LAC AR0, ∗+, 0 ;
I.3 Assembly source code for testing and median filter I-58
CMPSIMD AR0,∗+ ,L ;
PUSH;
CMPSIMD AR3,∗− ,G;
SAC AR0, ∗ , 0 ;
POP;
CMPSIMD AR5,∗ , L ;
SAC AR3, ∗−, 0 ;
POP;
CMPSIMD AR3,∗− ,G;
SAC AR3, ∗+, 0 ;
POP;
CMPSIMD AR3,∗ , L ;
SAC AR1, ∗+, 0 ;




LAC AR1, ∗+, 0 ;
CMPSIMD AR1,∗+ ,L ;
PUSH;
CMPSIMD AR4,∗− ,G;
SAC AR1, ∗ , 0 ;
POP;
CMPSIMD AR3,∗ , L ;
SAC AR4, ∗ , 0 ;
I.3 Assembly source code for testing and median filter I-59
POP;
CMPSIMD AR4,∗− ,G;
SAC AR4, ∗+, 0 ;
POP;
CMPSIMD AR4,∗ , L ;
SAC AR2, ∗+, 0 ;




LAC AR2, ∗+, 0 ;
CMPSIMD AR2,∗+ ,L ;
PUSH;
CMPSIMD AR4,∗− ,G;
SAC AR2, ∗ , 0 ;
POP;
CMPSIMD AR5,∗ , L ;
SAC AR4, ∗−, 0 ;
POP;
CMPSIMD AR5,∗ ,G;
SAC AR4, ∗+, 0 ;
POP;
CMPSIMD AR4,∗ , L ;
SAC AR3, ∗−, 0 ;
LAC AR5, ∗−, 0 ;
I.3 Assembly source code for testing and median filter I-60
CMPSIMD AR5,∗+ ,G;
CMPSIMD AR5,∗− ,G;
SAC AR3, ∗−, 0 ;
LAC AR4, ∗ , 0 ;
CMPSIMD AR3,∗ ,G;
PUSH;
LAC AR4, ∗−, 0 ;
CMPSIMD AR4,∗+ ,L ;
CMPSIMD AR4,∗− ,G;
SAC AR4 ∗ , 0 ;
POP;
CMPSIMD AR4,∗ , L ;
SAC AR3, ∗−, 0 ;
LAC AR4, ∗+, 0 ;
CMPSIMD AR5,∗+ ,L ;
CMPSIMD AR3,∗+ ,L ;
SAC AR3, ∗ , 0 ;
LAC AR4, ∗ , 0 ;
CMPSIMD AR3,∗ ,G;
PUSH;
LAC AR4, ∗ , 0 ;
CMPSIMD AR5,∗ , L ;
CMPSIMD AR4,∗ ,G;
SAC AR4 ∗ , 0 ;
POP;
I.3 Assembly source code for testing and median filter I-61
CMPSIMD AR6,∗ , L ;
SAC AR7, ∗+, 0 ;
LAC AR7, ∗− ,0;
SUB AR7, ∗+ ,0;
SAC AR0, ∗ , 0 ;
BNZ L1 ;
MAR AR0, ∗+ ,0;
MAR AR1, ∗+ ,0;
MAR AR1, ∗+ ,0;
MAR AR2, ∗+ ,0;
MAR AR2, ∗+ ,0;
MAR AR7,∗+ ,0 ;
LAR AR7, 0x0000 ;
MAR AR7, ∗+;
LAC AR7, ∗+ ,0;
SUB AR7, ∗− ,0;
SAC AR0, ∗ , 0 ;
BNZ L1 ;
L2 :LACK 0x00 ;
SAC AR6,∗+ ,0 ;
BU L2 ;
