Testing infrastructure and other considerations in a waferscale processor by Liu, Jingyang
c© 2020 Jingyang Liu





Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Electrical and Computer Engineering
in the Graduate College of the





With the breakdown of Dennard scaling, the improvement of computational
performance is exploited through parallel computing. Further stimulated by
emerging workloads, like machine learning, big data processing, and cloud
computing, which are either inherently parallel workloads or easily paralleliz-
able, processor core count continues to increase. Simply cramming more and
more processor cores into an ever larger chip is not feasible since the yield
of such a large chip can drop dramatically due to manufacturing defects.
Complicated large systems with multiple chips and/or packages connected
as nodes in a network can deliver massive computation performance but suf-
fer from high communication costs and large area, power, and cooling needed
for these systems. Therefore, novel architectures and integration technologies
are required to support the growing size of processors. Chiplet-based design is
a candidate that can realize large systems while promising high system yield.
One part of the system yield, namely the yield of individual dies, highly de-
pends on testing. After introducing the concepts of generic IC testing and
an example of testing infrastructure in an ARM-based system, this thesis
focuses on the testing infrastructure in a chiplet-based ARM-based wafer-
scale system. Some unique challenges and corresponding solutions in such
a waferscale system are addressed. Additionally, verification efforts during
this project are detailed, including RTL simulations and FPGA prototyping.
The thesis concludes with potential improvement to the author’s design as
well as future directions.
ii
To my parents, for their love and support.
iii
ACKNOWLEDGMENTS
First of all, I would like to express my sincere appreciation to my adviser,
Rakesh Kumar, for his support guidance of my research. His encouragement
and patience motivated and helped me complete my contribution to this re-
search project and this thesis. This research project would not have been
possible without the collaboration of the NanoCAD group at the University
of California Los Angeles, especially Saptadeep Pal and Prof. Puneet Gupta.
I am extremely grateful to Saptadeep for his leadership and management of
this research project, as well as his inputs and guidance towards the con-
tents of this thesis. I would like to extend my thanks to Nicholas Cebry for
the opportunity that brings me to this research project and his partnership
throughout this project, as well as Matthew Tomei for his contribution to
this project.
I would also like to thank my colleagues and mentors during my academic
career. First, I am very grateful to the advisers in the ECE Office of Student
Affairs, the Department of Electrical and Computer Engineering in the Uni-
versity of Illinois Grainger College of Engineering for offering me a teaching
assistant position, and my fellow ECE 411 teaching assistants. I would also
like to thank the members of my research group, particularly Andrew Smith,
Nathaniel Bleier, Muhammad Husnain Mubarik, and Adam Auten, for their
knowledge and friendship.
Additionally, I would like to recognize my friends who support and help
me, especially Iris Chen, Ruicheng Xian, Wanxian Yang, and Stan Yang.
Finally, I would like to express my deepest gratitude to my family, especially
my parents, for their unconditional support.
iv
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . viii
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 1
CHAPTER 2 LARGE SYSTEM ARCHITECTURES . . . . . . . . . 4
2.1 Chip Multiprocessors . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Manycore Processors . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Waferscale Systems . . . . . . . . . . . . . . . . . . . . . . . . 6
CHAPTER 3 IC TESTING . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Design for Testability . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 JTAG Standard . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Testing Infrastructure in an ARM-based System . . . . . . . . 13
CHAPTER 4 TESTING INFRASTRUCTURE FOR A WAFER-
SCALE SYSTEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1 Overview of a Waferscale System . . . . . . . . . . . . . . . . 22
4.2 Testing Infrastructure for the Waferscale System . . . . . . . . 26
CHAPTER 5 OTHER CONSIDERATIONS AND CHALLENGES
IN A WAFERSCALE SYSTEM . . . . . . . . . . . . . . . . . . . . 33
5.1 Accessing Uncore Components . . . . . . . . . . . . . . . . . . 33
5.2 Clock Distribution . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3 Bootup Sequence . . . . . . . . . . . . . . . . . . . . . . . . . 39
CHAPTER 6 VERIFYING THE TEST INFRASTRUCTURE . . . . 46
6.1 RTL Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 FPGA Prototype . . . . . . . . . . . . . . . . . . . . . . . . . 50
CHAPTER 7 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . 53
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
v
LIST OF TABLES
3.1 Transition table of the TAP controller FSM . . . . . . . . . . 10
3.2 JTAG-DP registers . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Debug port registers . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 MEM-AP registers . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1 Memory map of a tile . . . . . . . . . . . . . . . . . . . . . . . 24
5.1 Configuration registers . . . . . . . . . . . . . . . . . . . . . . 35
vi
LIST OF FIGURES
3.1 TAP controller FSM . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Block diagram of an ADI implementation . . . . . . . . . . . . 14
4.1 Block diagram of a tile . . . . . . . . . . . . . . . . . . . . . . 24
4.2 IEEE Std P1838 serial control mechanism . . . . . . . . . . . 29
5.1 Timing diagram of a read transfer . . . . . . . . . . . . . . . . 36
5.2 Timing diagram of a write transfer . . . . . . . . . . . . . . . 36
6.1 An example FPGA-MBED setup . . . . . . . . . . . . . . . . 51
vii
LIST OF ABBREVIATIONS
ADI ARM Debug Interface
AHB Advanced High-performance Bus
AMBA Advanced Microcontroller Bus Architecture
AMD Advanced Micro Devices
BFS Breadth First Search
CAS Compare-And-Swap
CMP Chip Multi-Processor
DAP Debug Access Port
DfT Design for Testability
DRAM Dynamic Random-Access Memory
DUT Device-Under-Test
EDP Energy-Delay Product
ELF Executable and Linkable Format
FPGA Field-Programmable Gate Array
FSM Finite State Machine
GPU Graphics Processing Unit
HPC High-Performance Computing
IC Integrated Circuits
IEEE Institute of Electrical and Electronics Engineers
JTAG Joint Test Action Group
viii
KGD Known Good Die
MOSFET Metal-Oxide-Semiconductor Field-Effect Transistor
PCB Printed Circuit Board
PISO Parallel-In-Serial-Out
PLL Phase-Locked Loop
RTL Register Transfer Level
SDF Standard Delay Format
Si-IF Silicon Interconnect Fabric
SRAM Static Random-Access Memory





Computer architects have always been pushing computing performance for-
ward no matter what the roadblocks may be. A single processor core’s design
went from simple multi-cycle design to complex multi-stage pipelined out-of-
order design in a few decades thanks to Moore’s law and Dennard scaling.
However, as Moore’s law and Dennard scaling come to an end, computer
architects have turned to parallel and heterogeneous computing to increase
performance rather than improving a single core’s design.
In the case of parallel computation, the number of cores of a processor has
kept increasing for the past several years as popular and parallel applica-
tions emerge, like machine learning tasks and graph applications. Increasing
number of cores most likely requires more area on silicon. The idea behind a
waferscale processor is to populate all the area that a silicon wafer can offer
with as many cores as possible to build a single processor. However, due to
the inevitable contamination and defects in wafer fabrication, it is not com-
mercially viable to produce a monolithic waferscale processor. In general,
many separate but identical processors are made in a single wafer, then the
wafer is diced into many dies, each containing a single processor. Due to fab-
rication defects, some of the dies may not be operational and therefore may
be discarded, leading to yield issues. In a monolithic waferscale processor, if
any defects occur, the entire wafer can be deemed not functional. To create
a waferscale processor, yield issues must be solved, and new architectures
must be developed to support a large number of cores in a processor.
Chiplet-based designs were used to mitigate yield issues in building large
processors. In these designs, multiple smaller dies, or chiplets, are assembled
into a larger processor. The chiplets are fabricated through a traditional
process like a single die processor. Each chiplet can be tested for defects
before bonding to ensure functionality. This process enables larger processors
with high yield.
1
With the rising popularity of chiplet-based design and innovative integra-
tion technologies like die stacking, a well-designed test access infrastructure
has never been more important. Not only is the traditional test access in-
frastructure within a single die essential before bonding, but also multi-die
structure must be well-architected to support test access capabilities for mid-
bond, post-bond, and post packaged tests. The entire test access infrastruc-
ture should provide modular test accesses from external I/Os to the dies of
the multi-die chip during the assembly process, namely individual dies before
assembly, a partially complete chip with several dies, and a fully complete
multi-die chip with and without a package. This thesis studies such testing
infrastructure of a chiplet-based waferscale system with thousands of cores
distributed in hundreds of dies.
In Chapter 2, we present the trend of increasing core count due to the
breakdown of Dennard scaling. Trying to maintain the growing compu-
tational performance and support popular parallel workloads, many archi-
tectures supporting multiple cores were proposed, implemented, and com-
mercialized. Novel integration technologies and architectures that further
advance this trend, specifically chiplet-based designs, will then be described.
In Chapter 3, we present the background and practices of generic IC test-
ing. To ensured high yield of a chiplet-based system, testing of individual
dies is essential. The concept of design for testability and details of an indus-
try standard of such design for testability are discussed to show how to test
a generic die. Then the details on how to test a more specific die, namely an
ARM-based system-on-chip, are presented in preparation for testing in an
ARM-based waferscale system.
In Chapter 4, we present a testing infrastructure suitable for a waferscale
system, continuing the discussion of Chapter 3. The target waferscale system
consists of many dies, with each die containing 14 ARM Cortex-M3 cores.
Honoring the physical constraints and testing requirements of the waferscale
system, the author developed a testing infrastructure by organizing existing
testing components for each core, designing custom logic, and writing API
for an external debugger to access the testing infrastructure.
In Chapter 5, we present other challenges in the target waferscale system
and propose corresponding solutions. Combining all relevant considerations,
a bootup sequence controlled by a piece of software running on external
debuggers was proposed, which covers the time between the initial power-up
2
of the waferscale system and the starting of program execution
In Chapter 6, we show the verification efforts for the test infrastructure as
well as the whole system, mainly RTL simulation and FPGA prototyping.
These methods not only help in assessing the feasibility of some designs but
also ensure the functionality of various components as well as the whole
system as the design scales up.
The work presented in this thesis is part of a joint project involving the
author, students in the NanoCAD Lab at UCLA, and other students in the
Passat Research Group at UIUC. The author is the major developer of the
testing infrastructure in the target waferscale system, the software for an
external debugger to access said testing infrastructure, a set of architectural
registers used to access various uncore components, clock forwarding mod-
ule, simulation scripts, and FPGA prototypes. Other students helped with
debugging the RTL, developed a network-on-chip enabling communication
between tiles using shared memory, wrote basic test programs that evaluate
the functionality of the network and high-level programs that test the wafer-
scale processor’s capability to handle meaningful workloads like a parallel
breadth-first search algorithm used in graph applications, and architected all
aspects of physical design including synthesis scripts, I/O pads, power, and




In this chapter, we discuss the motivation behind building large processors
and examples of the architectures of such large processors. Section 2.1 ex-
plains the reason why multi-core processors became necessary, discusses the
architectures of chip multiprocessors, and shows some early examples. Sec-
tion 2.2 introduces the evolution of chip multiprocessors to manycore pro-
cessors and discusses some prevalent architectures and the current trend of
cutting-edge architectures. Section 2.3 discusses the history and challenges
of waferscale processors, and how current and new technologies can inject
new life into this old idea.
2.1 Chip Multiprocessors
Computer performance has always been improved by the contributions of
both computer architecture and device and integration technology advance-
ment. It achieved rapid and drastic improvement since the first semiconduc-
tor integrated circuit (IC) was demonstrated in 1958 [1]. Only a few years
later, in 1965 Moore’s law predicted that the number of transistors in an inte-
grated circuit would double every year, although later in 1975 the prediction
was revised to doubling every two years based on industry observation of the
shrinking size of transistors and spacing [2], [3]. Dennard scaling was an-
nounced in 1974 describing the performance expectation of smaller devices.
The scaling principle stated that as MOSFETs reduce in size, the switching
speed would inherently be faster and the power density of integrated circuits
would remain constant [4]. This means that smaller devices result in higher
frequencies of integrated circuits while the cooling problem stays the same.
Coupled with Moore’s law on the rate at which the size of a transistor re-
duces, performance per watt of an integrated circuit doubles about every 18
4
months.
With Moore’s law and Dennard scaling somewhat accurately describing the
trend of semiconductor fabrication, there are even more transistors available
to computer architects aiming to improve the performance of a processor
core. These additional transistors allowed architecture advancements like
deep pipelining, speculation, superscalar design, and out-of-order execution
while running the chips at higher frequencies. However, after 30 years of
scaling, due to leakage constraint, threshold voltages and operating voltages
became difficult to reduce [5]. Stagnation of operating voltage reduction
means operating frequency can no longer increase without increasing power
density. This ultimately caused the breakdown of Dennard scaling and slowed
down performance improvements.
Computer architects have been improving performance by exploiting var-
ious levels of parallelism. The parallelism between individual instructions,
or instruction-level parallelism, was exploited in the advancement of a single
core’s microarchitecture, for example, superscalar, out-of-order and specula-
tive execution. Higher levels of parallelism, namely thread-level parallelism
and process-level parallelism, encourage parallel computer architecture like
chip-multiprocessor (CMP) and parallel applications [6].
The Piranha system is a research prototype that aggressively exploits chip
multiprocessing by integrating multiple simple cores into a single chip [7].
IBM introduced the first commercial multi-core processor chip with two 64-
bit microprocessors, the POWER4 [8]. Chip multiprocessors can now be
seen across computation domains ranging from portable consumer devices
like phones, tablets, and laptops, to data centers and supercomputers.
2.2 Manycore Processors
Increasing the number of processor cores in a system offers benefits like higher
computational throughput of single-threaded applications and higher perfor-
mance of scalable multi-threaded applications often seen in high-performance
computing (HPC). To build such systems with many cores, several architec-
tures were developed in the past few decades.
A computer cluster is arguably the most common way to achieve a massive
number of processor cores in a system. Such architecture is seen in servers,
5
data centers, and supercomputers. Many computers are connected as nodes
in a network with network cables and switches with software managing com-
pute load of the nodes and communication between nodes. One of the first
commercially successful clusters is the VAXcluster [9]. Cray Inc. uses similar
ideas to build supercomputers like Blue Waters at UIUC [10]. Blue Waters
consists of tens of thousands of compute nodes, connected in a 3D torus
network, and offers massive compute throughput, memory, and storage.
Realizing the massive area and expensive inter-node communication, In-
tel’s single-chip cloud computer (SCC) integrates 48 cores in a 2D-mesh net-
work and having core communication done through message passing using
on-die shared memory [11]. The Intel many integrated core architecture
(MIC) offers 32 cores, each of which can execute 4 hardware threads with
round-robin scheduling, in the form of a PCIe accelerator card, and supports
traditional multi-threaded programming models.
A similar inter-die communication problem is also present in an inherently
parallel processor class, GPUs. Multi-GPU was once the easy way to increase
the number of cores in a GPU system since the size of the GPU cannot
grow larger without sacrificing cost-effectiveness, similar to the situation of
larger CPUs. Nvidia proposed multi-chip-module GPU (MCM-GPU) that
integrates multiple GPU modules into a single package [12].
2.3 Waferscale Systems
Building large single-chip processors can offer much lower communication
energy compared to connected computers like clusters. The physical distance
between cores in a single-chip large processor is a few degrees of magnitude
shorter than that in a multi-chip computer or even different chips in different
computers. This idea leads to waferscale processors.
Waferscale processors are processors with the size of a wafer, which is much
larger than traditional diced up dies. The first major attempt at waferscale
integration was done in 1980 by Gene Amdahl’s Trilogy Systems [13]. How-
ever, the defects that were inevitable during the fabrication processes and
the redundancy needed to compensate for such defects made the low yield-
ing waferscale system not attractive [13].
Advancement in manufacture and integration technologies enables new ar-
6
chitectures to pursue large systems, such as chiplet-based designs and 2.5D
and 3D integration. These technologies aim to reduce the communication
bottleneck between processor and processor, processor and memory, and
memory and memory.
3D-stacked memory architectures can dramatically decrease DRAM access
latency and increase bandwidth [14], [15], [16]. Other 3D integration archi-
tectures based on through silicon vias (TSVs) and monolithic 3D ICs are also
commonly studied [17].
2.5D integration with interposer-based designs received much wider adop-
tion in the industry. Chip-on-wafer-on-substrate (CoWoS), an interposer-
based integration technology developed by TSMC, can connect multiple chips
using micro-bumps and TSVs on a large silicon interposer [18], [19]. The em-
bedded multi-die interconnect bridge (EMIB) connected chips with very thin
silicon bridges embedded within the package substrate [20]. The wafer scale
engine (WSE) from Cerebras Systems fitted enough redundant cores to ac-
count for defect-induced failure, resulting in a large and functional chip with
one percent of the cores failing [21].
Another novel integration technology, silicon interconnect fabric (Si-IF),
uses copper pillars instead of micro-bumps and TSVs in traditional interposer
technologies like CoWoS from TSMC [22], [23]. Due to the aspect ratio
limitation of TSVs, interposers need to be thinned, making them fragile and
size limited, while Si-IF is a rigid interconnect that can scale up to large
sizes without mechanical support from packages [23]. A GPU case study
with Si-IF based waferscale processor shows up to 18.9x speedup and 143x




In an Si-IF-based waferscale system, there are three components contributing
to the system yield: yield of the die, yield of the copper pillar bonds and yield
of the Si-IF substrate [24]. Previous work observed copper pillar yield higher
than 99% [23], calculated high yield of the Si-IF substrate [24] and 100%
electrical connectivity for an Si-IF-based system with interconnected dies
[24]. By identifying known-good dies (KGD), high yield of the pre-assembly
die can be ensured. KGD testing mainly consists of electrical testing and
functional testing. The rest of the thesis will focus on functional testing.
This chapter first covers generic IC testing, with section 3.1 introducing
the concept of design for testability (DfT) and section 3.2 details an industry
standard for DfT. Section 3.3 moves on to testing in a specific architecture,
namely an ARM-based system, which is the same base processor architecture
used in a waferscale system that the later part of this thesis is concerned
about. The next chapter will detail the test infrastructure in an ARM-
based waferscale system that is built on the fundamentals introduced in this
chapter.
3.1 Design for Testability
KGD testing techniques [25], [26] can be used to pre-select the dies to be used
for assembly on the targeted chiplet-based system [24]. These techniques have
helped to combat the difficulty of bare die probing due to the small sizes of
the I/O pads.
In addition to electrical techniques briefly mentioned in the last paragraph,
functional validation is essential and usually facilitated by structures built
into the hardware design. One of the techniques is called design for testability
(DfT).
8
Testability consists of controllability and observability [27]. In digital cir-
cuits, controllability means a certain node can be set to a certain value (low or
high). Observability means the state of a certain node can be observed. The
most common design for testability is the scan chains. The scan chains are
usually parallel-in-serial-out (PISO) shift registers that can store the states
of some internal nodes and can be set to change the state of some internal
nodes. With the use of such design for testability, tests that patterns of
input stimuli can be applied to the design under test (DUT), and response
from the DUT can be checked to validate functionality. JTAG, or IEEE Std
1149.1, is a standard for design for testability.
3.2 JTAG Standard
JTAG communication protocol, or IEEE Standard 1149.1 [28], is a serial
communication protocol that uses a minimum of a four-wire interface. The
minimal four signals are TDI (Test Data In), TDO (Test Data Out), TCK
(Test Clock), and TMS (Test Mode Select). TMS is used to manipulate the
state of a TAP (Test Access Port) controller, which essentially selects which
shift register the test stimulus will use with TDI. The TAP controller FSM
is a 16-state Moore machine with TMS and TCK as inputs. Transitions
between states occur at the rising edges of TCK. The TAP controller FSM
is detailed in Table 3.1 and Figure 3.1.
In the following paragraphs, typical operations of JTAG will be described,
including manipulating the TAP controller states and shifting the registers.
“JTAG software” refers to the piece of code that will interface with the four-
wire interface.
Since TCK is driven by the testing harness, but acts as the clock signal
for the JTAG hardware, “toggle TCK to have X rising edge of TCK” will be
shortened as “X TCK cycles” for the rest of the discussion.
One notable feature of the TAP FSM is that in any state, if TMS = 1
for 5 TCK cycles, the state will be in Test Logic Reset. This makes it easy
to reset the TAP controller state without the optional TRST (Test Reset)
signal.
In the TAP controller state names, IR stands for Instruction Register and
DR stands for Data Register. Generally, there is only one instruction register
9




TMS = 0 TMS = 1
Test Logic Reset Run Test Idle Test Logic Reset
Run Test Idle Run Test Idle Select DR Scan
Select DR Scan Capture DR Select IR Scan
Capture DR Shift DR Exit 1 DR
Shift DR Shift DR Exit 1 DR
Exit 1 DR Pause DR Update DR
Pause DR Pause DR Exit 2 DR
Exit 2 DR Shift DR Update DR
Update DR Run Test Idle Select DR Scan
Select IR Scan Capture IR Test Logic Reset
Capture IR Shift IR Exit 1 IR
Shift IR Shift IR Exit 1 IR
Exit 1 IR Pause IR Update IR
Pause IR Pause IR Exit 2 IR
Exit 2 IR Shift IR Update IR
Update IR Run Test Idle Select DR Scan
and multiple data registers. All registers are shift registers since this is a
serial communication protocol. Based on the values inside IR, different data
registers will be selected.
The basic hardware organization for a JTAG protocol consists of a TAP
controller, which has TMS and TCK as inputs, and several registers, in-
cluding one instruction register and several data registers, and multiplexers
controlled by the TAP controller to select which register is connected to
TDO. There could be de-multiplexers, controlled by the TAP controller to
select which register is connected to TDI, but the functionality of the proto-
col is identical to an implementation where the TDI is connected to all the
registers. In Shift IR and Shift DR states, TDI is connected to the input
of IR or a DR respectively, and TDO is connected to the output of that
register, which means for every bit shifted in through TDI, TDO will have
the bit reside in the register. The width of the different data registers can be
different based on devices and implementation, but the instruction register
is typically 4-bit wide.
A typical JTAG operation uses the important states in the following order
(transitional states are not listed):
10
Figure 3.1: TAP controller FSM
1, Reset; 2, Shift IR; 3, Update IR; 4; Run Test Idle; 5, Shift DR; 6,
Update DR; 7; Run Test Idle
If needed, steps 2 through 7 can be repeated for different instructions.
Resetting the TAP FSM can be done with the TRST signal when present
or by holding TMS = 1 for 5 TCK cycles. After resetting the TAP FSM, it
will be in the Test Logic Reset state, then Run Test Idle state can be safely
reached by holding TMS = 0 for 1 TCK cycle.
Next, an instruction is ready to be shifted into the IR. From Run Test
Idle state, simply hold TMS = 1 for 2 TCK cycles then hold TMS = 0 for 2
TCK cycles, and Shift IR state will be reached. To shift an instruction, hold
11
TMS = 0 with TDI as bits in the instruction and toggle TCK to have 1 cycle
per instruction bit. After the bits of an instruction are shifted into the IR,
to commit the changes to IR, simply hold TMS = 1 for 2 TCK cycles, and
the Update IR state will be reached. At this point, the IR will update the
connections of TDI and TDO so that they will be the input and output of a
certain data register, which means that a specific data register, determined
by the instruction used, is ready for use. Assuming the TAP controller is still
in the Update IR state, to reach the Shift DR state, there are two options.
The first is to go straight from Update IR to Select DR Scan by setting TMS
= 1 for 1 TCK cycle, then transition to Shift DR state after 2 TCK cycles
of TMS = 0. The second is to go to the Run Test Idle state by setting TMS
= 0 for 1 TCK cycle, go to Select DR Scan state by set TMS = 1 for 1
TCK cycle, and then transition to Shift DR state by holding TMS = 0 for 2
TCK cycles. The first option will be faster since it takes 3 TCK cycles rather
than the 4 cycles taken by the second. However, in software implementation,
a transition to Run Test Idle state is a more general and useful operation,
which means that the second option can potentially have more code reuse in
JTAG software.
Now that the Shift DR state is reached, several different situations will be
presented based on the functionality of the selected DR. Typically, a data
register can be used for read-out only, for write-in only, or a combination of
both. If the data register is used for read-out only, meaning the bits in the
DR are the value for read-out, then the TDI can be either 1 or 0 since it will
not affect the underlying devices that are being accessed. In this case, it is
crucial to record TDO every TCK cycle in the JTAG software to have the
proper read-out value. If the data register is for write-in only, meaning the
bits in the DR are meaningless while the bits to be shifted in are meaningful,
then TDI and TCK should be toggled while holding TMS = 0 to have the
correct data bits shifted into the DR selected. If the data register is used for
both read-out and write-in, then the software needs to combine both correct
shift-in through TDI and correct recording of TDO in the meantime. After
shifting the correct number of bits in the Shift DR state, the DR changes
can be committed by moving the TAP controller state to Update DR, which
can be done by holding TMS = 1 for 2 TCK cycles.
After Update DR, if the JTAG access is finished, then it is ideal to move
the TAP controller to Run Test Idle state by setting TMS = 1 for 1 TCK
12
cycle. Otherwise, the shift IR then shift DR sequence can be repeated to have
multiple data registers updated through different instructions each time.
3.3 Testing Infrastructure in an ARM-based System
Both the concept of DfT and the details on JTAG can be used in generic digi-
tal IC testing. This section, however, discusses how those pieces of knowledge
can apply to a specific architecture, namely an ARM-based system. The tar-
get system contains an ARM core and some other components connected to
the core. Subsection 3.3.1 will introduce Arm Debug Interface, a test access
protocol which is based on the JTAG standard, and Debug Access Port, an
implementation of said ADI. Subsection 3.3.2 will cover a piece of software,
which follows the ADI specification in order to access the DAP and therefore
access the ARM-based system.
3.3.1 Arm Debug Interface and Debug Access Port
The Arm Debug Interface (ADI ) provides access to debug functionality that
is provided by debug components in an embedded System-on-Chip (SoC) [29].
There are two major types of debug functionality: embedded core debug func-
tionality and system debug functionality. An embedded microprocessor can
provide debug features to enable the debugging of applications, like facilities
that enable an external host to modify and assess the state of the processor
by providing access to the internal registers and the memory system. An
SoC can provide system-level debug features by enable an external host to
access components outside the microprocessor cores and interconnects of the
system.
An implementation of the ADI is called a Debug Access Port (DAP). A
DAP provides a debugger with a standard interface to access debug resources
in systems [29]. Figure 3.2 shows how the DAP is connected between a
debugger and the system to be debugged.
The DAP consists of a debug port (DP), multiple access port (AP), and
resource-specific transport that connects the DP to APs and APs to appro-
priate debug resources. The DP consists of a physical connection to the
debugger, including JTAG debug port (JTAG-DP) and multiple DP regis-
13
Figure 3.2: Block diagram of an ADI implementation
ters. An AP uses a resource-specific transport to access a debug resource,
which can be the debug registers of a microprocessor core, a memory system,
etc.
Since JTAG standard is chosen and all the resources that need to be ac-
cessed are memory-mapped, only JTAG-DP and MEM-AP needed to be
discussed among the various functionality provided by the DAP.
The JTAG-DP is based on the JTAG standard [30]. The physical connec-
tion from a debugger to the JTAG-DP uses the standard minimal four-wire
interface and the optional TRST. The JTAG-DP consists of a TAP state
machine, an instruction register (IR), and various data registers (DR). Ta-
ble 3.2 lists the DR registers, their corresponding IR values, widths, and
descriptions.






b1010 DPACC 35 JTAG-DP Debug Port (DP) Access Register
b1011 APACC 35 JTAG-DP Access Port (AP) Access Register
b1110 IDCODE 32 JTAG-DP Device ID Code Register
b1111 BYPASS 1 JTAG-DP Bypass Register
With BYPASS instruction, the corresponding register is only 1-bit wide.
This is essential in a multi-device scenario, which will be covered in section
4.2.2.
With IDCODE instruction, the data register will contain a 32-bit device id
code that contains information on version code, part number, and designer
ID. Shift-in data is ignored, making this essential a read-only register. As
14
far as this project is concerned, the shift-out 32-bit device id code is a fixed
non-uniform magic number only used for sanity checks.
Notice that DPACC and APACC have the same width of 35; in fact, they
have the same format and operations as well. Before shifting, the 35-bit
shift register consists of a 32-bit read-out, which resides in bit position 34
to 3, then a 3-bit ACK, which resides in bit position 2 to 0. The 3-bit
ACK indicates the status of the previous transaction. Only two values are
implemented, namely b010, which indicates OK/FAULT, and b001, which
indicates WAIT. OK/FAULT means the previous transaction was completed,
either successfully or was faulted.
Following OK/FAULT, the 35-bit register takes the following format: 32-
bit input data in bit location 34 to 3, 2-bit DP or AP address in bit location 2
to 1, then an R/W bit in bit location 0. The R/W indicates whether the input
data is for a read operation or a write operation: If the bit is 0, meaning write,
the 32-input data will be requested to write into the destination register, and
if the bit is 1, meaning read, the 32-bit input data will be ignored and a
request for reading the destination will be generated, and the read-out will
be in the 32-bit read-out in bit location 34 to 3 when it is accessed again.
Through DPACC, a debugger can access various registers inside the DAP.
These DP registers are shown in Table 3.3. They are all 32-bit registers.






The CTRL/STAT register is used to control and obtain status information
about the DP. The details of the CTRL/STAT register can be found on page
B2-56 of [29]. Only bits [31:28] are used in this project. These four bits,
in the order of bit [31] to bit [28] are system powerup acknowledgment, sys-
tem powerup request, debug powerup acknowledgment, and debug powerup
request. The two powerup request bits are used to initiate powerup of the
system and debug domains when asserted, or request removal of power when
de-asserted. The two powerup acknowledgment bits default to low states and
only go high on receipt of the corresponding request.
15
The RDBUFF register is a 32-bit read-only buffer. In this project, it is
used to buffer the read result from either a DPACC or APACC read request.
The SELECT register is used to select an AP and the active register banks
within the selected AP. Bits [31:24], APSEL, selects AP. Bits [23:8] are re-
served. Bits [7:4], APBANKSEL, select the active four-word register bank
on the selected AP. In this project’s case, 0h00 selects memory access port
(MEM-AP). The details of all the MEM-AP registers can be found on page
C2-151 of [29]. The registers that are used in this project are shown in Table
3.4. They are all 32-bit registers.
Table 3.4: MEM-AP registers
Bank Address Name Accessibility
0x0 b00 Control/Status Word (CSW) RW
0x0 b01 Transfer Address (TAR) RW
0x0 b11 Data Read/Write (DRW) RW
0xF b11 Identification Register (IDR) R
These MEM-AP registers can be accessed through APACC. In Table 3.4,
the bank is selected by APBANKSEL in DP SELECT register, while the
address is from the address field of bits [2:1] of APACC.
The control/status word (CSW) register configures and controls accesses
through MEM-AP to or from a connected memory system. The details of
all the fields can be found on page C2-178 of [29]. Some notable ones are
AddrInc at bits [5:4] and Size at bits [2:0]. The AddrInc bits control the
auto-increment of address in the transfer address register (TAR), which will
be discussed in the next paragraph. The Size bits configure the data type,
for example, 16-bit halfword, 32-bit word, etc., of the memory access to be
initiated.
The transfer address register (TAR) holds the memory address to be ac-
cessed through MEM-AP accesses. This is a 32-bit address for the memory
system that the MEM-AP is connected to. The auto-increment of TAR hap-
pens on every successful read or write access to the data read/write (DRW)
register.
The data read/write (DRW) register holds a value that depends on the
access mode. In write mode, DRW holds the value to write for the current
transfer to the address specified in the TAR. In read mode, DRW holds the
16
value that is read in the current transfer from the address that is specified
in TAR.
The identification register (IDR) identifies the AP. As is defined on page
C1-146 of [29], a value of zero indicated that there is no AP present. The
various fields contain revision bits, designer identification code, class of AP,
etc. In this project, it is a fixed non-zero magic number for sanity check use
only, just like the JTAG-DP IDCODE register.
After all the necessary details of ADI and DAP are covered, the program-
mer model will be discussed in the next section.
3.3.2 Accessing the DAP
This section will cover the base version of the software that only interfaces
with one device. The modification needed for the waferscale system test
access infrastructure that contains multiple devices (cores and dies) will be
covered in section 4.2.2. The main purpose of the software is for the debugger
to initiate memory accesses to the system through a JTAG connection. Only
memory access is needed because all the controls and configurations required
are either memory-mapped registers or memory blocks that belong to the
memory system.
The device in this project that serves as the external test probe, or de-
bugger, is an ARM Mbed LPC1768 Board. This is an ARM CORTEX-M3
based micro-controller, with tens of available I/O pins, running Mbed OS. It
is a coincidence that the cores used in the waferscale system are the same as
in this micro-controller. The software is written in C++, compiled through
Mbed online IDE directly to the LPC1768 board. There should be no modifi-
cations needed other than the I/O pin assignments for the software to run on
other Mbed-OS-based boards, so for generality and simplicity, the LPC1768
board could be referred to as the Mbed board. As far as this chapter is
concerned, i.e., the basic JTAG software that only talks to a one core device,
the LPC1768 board has a few output pins, including TDI, TMS, TCK, and
one input pin, TDO.
All JTAG related functionalities, including core debug, program loading,
and access memory, are encompassed in a C++ class simply called JTAG.
In the software, the digital I/O pins on the LPC1768 board can be directly
17
manipulated. The variables in this JTAG class will be covered when it is
first mentioned in the following discussion.
First of all, several functions or methods serve as the basis of JTAG con-
trol, which simply set one of the three input wires in the JTAG interface to
be high or low, and one function to toggle TCK for some cycles. These func-
tions are: DataLow(void), DataHigh(void), TMSLow(void), TMSHigh(void),
TCKLow(void), TCKHigh(void), and TCKTicks(unsigned int c).
Then there are a few functions for TAP controller state manipulation,
which will drive both TMS and TCK to reach a certain state. These func-
tions are: setState(unsigned char c), leaveState(void), and reset(void). There
is a variable in the class of type char, called state, that is used to determine
what state the on-device TAP controller should be in. In our implementation,
this variable can only take values from “r”, which stands for reset or Test
Logic Reset, “n”, which stands for idle or Run Test Idle, “i”, which stands
for IR or Shift IR, and “d”, which stands for DR or shift DR. The reason for
only including 4 states out of 16 is that the on-device TAP controller should
only be in these states for more than 1 TCK cycle, and the reset will only
be transitional states. Recall the first thing discussed after introducing the
TAP FSM, namely that if TMS = 1 for at least 5 TCK cycles, the state will
be in Test Logic Reset. This is exactly how the function reset(void) works.
It simply calls function TMSHigh(void), which sets TMS pin to be 1, and
toggles TCK for at least 5 times; in our case, we toggle TCK for 10 times
just to be safe, then set the state variable to be “r”. Then there is leaveS-
tate(void), which transitions the on-device TAP controller to Run Test Idle
state, or “n”. To do this, the leaveState(void) function will access the class
variable state, which can only have the value of “r”, “n”, “i”, or “d”, then
use a switch case to determine TMS and TCK operations. The transitions
from these 4 states to “n” state can be easily found in the discussion of the
TAP FSM in section 3.2. After leaveState(void), setState(unsigned char c)
is ready to be used. The function setState(unsigned char c) assumes the on-
device TAP controller state is in “n” or Run Test idle, and it will transition
to the state given by the input variable c, which again can only take values
from “r”, “n”, “i” and “d” and set the class variable accordingly. From “n”
to “n”, do nothing. From “n” to “r”, simply call function reset(void). From
“n” to “i” and from “n” to “d”, the operations can again be found in the
discussion of the TAP FSM in section 3.2.
18
The next few functions are the essential function of the whole JTAG class,
and they are shiftBits(unsigned int data, int n), setIR(unsigned char inst),
and shiftData(unsigned int data, char addr, bool rw). They directly control
the operations of shifting bits into the on-device shift registers via TDI like a
PISO (parallel-in-serial-out) and record the output from the on-device shift
registers via TDO to construct the read-out value like a SIPO (serial-in-
parallel-out).
The function shiftBits(unsigned int data, int n) has a return type of un-
signed int, which returns value read-out from the on-device registers. The
main body of function shiftBits(unsigned int data, int n) is a loop with n
iterations, indicating the number of bits to be shifted. Each iteration will
record one bit from TDO, put the read-out bit in the correct bit position, set
TDI to be either high or low depending on the data bit location, and toggle
TCK once. After n iterations, the function returns the read-out n bits value.
This function will be called whenever bits need to be shifted.
Recall that in Figure 3.2, the debugger needs to access and configure the
DAP through JTAG-DP to access the memory. The JTAG-DP registers can
be found in Table 3.2. Now that the function shiftBits() is established, the
software uses two wrappers for shifting IR and shifting DR, namely setIR()
and shiftData().
The function setIR(unsigned char inst) simply passes the 4-bit argument
inst to shiftBits(), and sets the other argument of shiftBits() n to be 4. Notice
that the argument inst should only be one of the four listed in Table 3.2,
namely b1010 for DPACC, b1011 for APACC, b1110 for IDCODE and b1111
for BYPASS. This function assumes the on-device TAP controller is already
in state Run Test Idle or “n”. After entering setIR(), the function first
transitions the on-device TAP controller to Shift IR by calling setState(‘i’).
Then the input argument of the function is shifted in through calling function
shiftBits(inst, 4). Finally, leaveState() is called to transition the on-device
TAP controller to be in Run Test Idle or “n” for future accesses.
The function shiftData(unsigned int data, char addr, bool rw) is designed
to access DPACC and APACC due to the operations needed. Recall that in
section 3.3.1, these two registers are 35-bit wide. Before shifting, there are 32
bits of data, and 3 bits of status. The 3-bit status can be OK/FAULT, WAIT,
or undefined. In our software implementation, for simplicity, if OK/FAULT
is seen, then the software moves on instead of checking whether the previous
19
transaction was completed successfully or faulted since it will need to read
another register, CTRL/STAT in DP, for detailed status, which can make
the software run more than twice as long. To compensate for this shortcut,
other high-level checks were implemented. WAIT indicated that the previous
transaction has not completed. Therefore, if WAIT is seen, the software will
retry DP or AP access until OK/FAULT is seen.
It is also assumed that before entering the function shiftData(), the on-
device TAP controller is in Run Test Idle or “n” state, which is guaranteed
since leaveState() will always be called before entering this function, or more
precisely, at the end of all functions that shift bits.
After entering the function shiftData(), there is a while loop that checks a
local Boolean variable called gotwait, which is initialized to true before the
loop. At the beginning of the loop, gotwait is set to false and setState(‘d’)
is called to transition the TAP controller to Shift DR or “d” state. Then
the 1-bit R/W and 2-bit address from the function argument is shifted in
by calling that shiftBits function. The 3-bit ACK returned by the shiftBits
function is checked. If WAIT is seen, then the local variable gotwait will
be set to true, and the TAP controller will be transitioned back to Run
Test Idle or “n” state through the leaveState() function and the while loop
will continue to the next iteration. If OK/FAULT is seen, then this while
loop is broken out. If any other value is seen, which is not implemented, a
print-out will be generated, leaveState() will be called and the program will
terminate. After the while loop, i.e., OK/FAULT is seen, shiftBits(data, 32)
is called and the returned value is saved as the return value for this function
shiftData(unsigned int data, char addr, bool rw) at the end after the function
leaveState() is called to set the on-device TAP controller to be in Run Test
Idle or “n” state for future accesses.
Using setIR() and shiftData(), another level of abstraction is created to
make memory access easy to program. Before talking about this set of func-
tions, macros were created for all the common bit patterns, including IR
values for JTAG-DP and 2-bit addresses for DP and MEM-AP. The values
can be found in Table 3.2, Table 3.3 and Table 3.4 respectively.
First, there is the function writeBanksel(unsigned int banksel) that calls
setIR(JTAG DPACC) and shiftData(banksel ¡¡ 4, DP SELECT, WRITE).
Note that JTAG DPACC is a macro for b1010, DP SELECT is a macro for
b10, and WRITE is a macro for b0. The reason behind banksel ¡¡ 4 is that
20
the field definition for APBANKSEL is at bit [7:4]. This function readies the
subsequent accesses of MEM-AP registers CSW, TAR, and DRW.
Then there are four functions that perform either a read or write operation
to either DPACC or APACC, namely readDPACC(unsigned char addr), writ-
eDPACC(unsigned int data, unsigned char addr), readAPACC(unsigned char
addr), and writeAPACC(unsigned int data, unsigned char addr). These func-
tions are very similar, with the only difference being the macros used. These
functions first call setIR(JTAG DPACC) or setIR(JTAG APACC), then call
shiftData(0, addr, READ) or shiftData(data, addr, WRITE). Note that the
two read functions will return the 32-bit data shifted out by shiftData(0,
addr, READ).
To initiate a memory access, two higher-level functions were created. They
are simply called readMemory(unsigned int addr) and writeMemory(unsigned
int addr, unsigned int value). These two functions call a sequence of the afore-
mentioned JTAG-DP access functions: writeBanksel(0), writeAPACC(WORD,
AP CSW), writeAPACC(addr, AP TAR), readAPACC(AP DRW) or writeA-
PACC(value, AP DRW). Note that WORD is a macro used to set the proper
bits for AP CSW register to have a transaction of a 32-bit word.
Another modified version of readMemory and writeMemory was also writ-
ten to facilitate common use cases. For example, the wipeMemory function
uses a loop of write writeAPACC(0, AP DRW) instead of a single one in
writeMemory to wipe, i.e., write 0 to, a certain memory region.
21
CHAPTER 4
TESTING INFRASTRUCTURE FOR A
WAFERSCALE SYSTEM
Expanding upon IC testing for a single-core ARM-based system, this chapter
details a test infrastructure for an ARM-based waferscale system. Section 4.1
gives an overview of the architecture of the said waferscale system and some
physical constraints since the goal of this project is a taped-out functional
prototype. Section 3.2 discusses the test infrastructure of the waferscale
system, including the hardware organization and software needed to access
the test infrastructure.
4.1 Overview of a Waferscale System
4.1.1 Waferscale System Architecture
The target waferscale system consists of hundreds of tiles, connected on a
singular wafer through Si-IF. Each tile consists of several individual silicon
die. In this tape-out, a compute die and a separate memory die, connected
through Si-IF, form a tile. On the compute die, there are 14 ARM CORTEX-
M3 cores, an ARM AMBA 3 AHB bus matrix, a set of memory-mapped
registers, network-on-chip modules, and clocking-related modules. On the
memory die, there are 5 banks of shared memory, and each bank is a 128KB
SRAM block.
ARM CORTEX-M3 is based on 32-bit ARMv7-M architecture [31]. It
has a three-stage pipeline and supports Thumb and Thumb-2 instruction
set. The barebones processor consists of a processor core and an internal
bus matrix. The processor core connects to the master side of the internal
bus matrix, and there are three ports on the slave side: i-code, d-code, and
system. The i-code and d-code ports are used for instruction and data for
the core, respectively. The system port is used for any other peripherals.
22
In our design, the i-code and d-code ports are connected to a single private
64KB SRAM through an arbiter-like component called “code mux”, which
lets the core use the whole 64KB memory space as both instruction and data
memory. The reason we call this a private SRAM is that it can only be used
by or accessed through the core that it is connected to, compared to the 5
banks of shared SRAM that can be accessed by more entities, which will be
covered later. The internal bus matrix is configured to connect the master
side port to a slave side port through memory addresses. In our case, any
address above and including 0x20000000 will be routed to the system port,
whose connectivity will be detailed in the next few paragraphs. The rest
of the memory space, i.e., any address below 0x20000000, will go through
to either i-code or d-code port, but eventually arrives at the private 64KB
SRAM.
The network-on-chip (NoC) of the waferscale system uses a mesh topology,
with each tile as a node in the mesh. The router design is based on the
OpenCelerity project [32]. Our NoC has two meshes, one with x-y routing
and the other with y-x routing. That means each tile has two routers, one
for the x-y mesh and the other for the y-x mesh.
Architecturally, all components in tiles communicate through the AHB
bus matrix. Our bus matrix is a 16x7 matrix, meaning it has 16 ports
on the master side and 7 ports on the slave side. Out of the 16 ports on
the master side, 14 are connected to the system ports of the 14 cores and
the other 2 are connected to the 2 depacketizers. Out of the 7 ports on
the slave side, 5 are connected to the 5 banks of shared SRAM memory,
one port is connected to the memory-mapped registers, and the last one is
connected to a packetizer. The two depacketizers and one packetizer are used
to receive and send packets through the NoC, respectively. We call this 16x7
bus matrix, which connects all components in a tile, the shared bus matrix.
It is configured to be sparsely connected. The slave side ports are divided by
memory address space, detailed in Table 4.1. The block diagram for a tile is
shown in Figure 4.1.
The processor cores used in the waferscale system, namely ARM CORTEX-
M3, support a few debugging features. Some notable ones are processor halt,
processor core register access and full system memory access. Among the
debug registers, the most important one that will be heavily used in our
system is the DHCSR (Debug Halting Control and Status Register), which
23
Table 4.1: Memory map of a tile
Memory Space Description
0x20000000-0x20001FFF Shared SRAM bank 0
0x20002000-0x20003FFF Shared SRAM bank 1
0x20004000-0x20005FFF Shared SRAM bank 2
0x20006000-0x20007FFF Shared SRAM bank 3
0x20008000-0x20009FFF Shared SRAM bank 4 (local core access only,
“bookkeeping” space)
0x2000A000-0x3FFFFFFF Unused
0x40000000-0x5FFFFFFF Configuration register (only 0x40000000-
0x400000FF are used)
0x60000000-0xDFFFFFFF Packetizer
Figure 4.1: Block diagram of a tile
enables processor halt. To access these features external to the core, the
optional Debug Access Port (DAP) can be used, which is already covered in
detail in section 3.3.1. The DAP is a master into the private/internal bus
matrix of the core. By accessing the DAP through the methods described in
section 3.3.2, AHB transactions will be generated to the private/internal bus
matrix of the core. The DAP enables access to the core’s debug registers,
including DHCSR, and full system memory space, including the core’s private
SRAM, and all the shared memory space through the system port of the
internal/private bus matrix of the core, which connects directly to the shared
bus matrix of the tile, including shared SRAMs, configuration registers of




To tape-out the waferscale system, the followings are some details and their
implications for our design. We provide two GDSII files to a foundry, one for
the compute die, and one for the memory die. The foundry made hundreds
of each die and sent them to a Si-IF lab, where all the dies are connected on
a wafer. That implies all the tiles must share the same design. On the wafer,
only neighboring tiles are connected via short Si-IF connections and any kind
of broadcasting signals across multiple dies is impractical to do with Si-IF.
Therefore, on-die bypassing channels are designed to route signals from one
side of the die to another with double inverter buffers.
Since the waferscale system is designed as a general-purpose computing
system, all the memories are based on SRAMs instead of ROMs, meaning
we must program all the required SRAMs for a given application. Before
bonding each die to the wafer, testing must be done as post-silicon verifica-
tion. The I/O pads for pre-bonding tests are large in comparison to Si-IF
I/O pads because they will be wired-bonded to a testing harness instead of
Si-IFs and they are sacrificial.
To demonstrate Si-IF technology, we designed the bandwidth between tiles
to be as wide as possible. Consider that the link between two routers on
neighboring tiles is designed to be 200 bits, 100 bits as input, and the other
100 bits as output. On each side of the tile, there are 2 links for the 2 meshes,
which comes to 400 I/O pins for NoC only.
Due to the sheer size of the waferscale system, long wire lengths introduce
many challenges. For debugging purposes, it is nearly impossible to have
a few signals from every tile connected to the edge of the wafer. For clock
distribution, it is impossible to only use a single clock source to synchronize
the whole wafer.
To provide a solution for debugging and programming needs for the wafer-
scale system with as few I/Os as possible, a debugging infrastructure based
on JTAG protocol is developed. To provide a solution for clocking the wafer-
scale system, a clock forwarding scheme is developed.
25
4.2 Testing Infrastructure for the Waferscale System
This section details the test access infrastructure, which is based on section
3.2, in a chiplet-based waferscale system described in section 4.1.1. The
hardware subsection covers the organization of the DAPs and additional
components and signals. The software subsection covers modifications and
additions to the basic software described in section 3.3.2 to support the test
access need for the waferscale system. It is desirable to achieve various lev-
els of test access granularity. The specific requirements are the following:
memory access to a single core in a certain die and its private registers and
memory region in the waferscale system, memory access to memory-mapped
components in a certain die, memory access to a certain set of cores in the
same or different dies and their private registers and memory region, memory
access to all the cores in a certain die or multiple dies, etc. In short, the test
access architecture should support an arbitrary core, an arbitrary number
of cores, and all cores, while having a minimal slowdown in case of mass
memory accesses like memory wiping and program loading.
4.2.1 Hardware Organization
It is beneficial to have debugging support for all 14 cores within a tile while
minimizing the number of I/O dedicated to debugging. Ideally, only the nec-
essary four wires of the JTAG interface should be dedicated as I/O for a tile.
Notice that the DAP is internal to the core, with a JTAG interface. There
are 14 cores inside a tile, meaning there are 14 sets of JTAG interface needed
for full 14 core debugging support, while only having one four-wire interface
as the I/O of the tile. To solve this problem, daisy-chained JTAG, which is
within the IEEE Standard 1149.1, is implemented. The daisy-chained JTAG
is used to chain up multiple devices, typically multiple packages or dies, on a
PCB, which can be accessed by a test probe with only one four-wire JTAG
interface.
The connections of the 14 DAPs follow the daisy-chained JTAG configu-
ration. To make the discussion of the wires easier, the I/Os of the tile will
be called a tile signal, for example, tile TDI, and the signals for core X will
be called signal X, for example, TDI X. The tile TMS and tile TCK from
the I/Os of the tile are connected directly to the TCKs and TMSs of all the
26
cores. The tile TDI is connected directly to core 0’s TDI, then core 0’s TDO
is connected directly to core 1’s TDI, core 1’s TDO is connected to core 2’s
TDI, and so on. Finally, the TDO of core 13 (the last core) is connected to
the tile TDO. This configuration forms a 14-device chain, with only the four-
wire tile I/Os as the interface. Since the TCK and TMS are shared by all
DAP, the TAP controller state in the DAP will be identical. Whenever they
are in Shift IR or Shift DR states, the IR or specific DR from all the DAPs
effectively forms a shift register that is 14 times the width of an individual
IR or specific DR.
Recall that DAP enables core debugging, including processor halt, and ac-
cesses the full memory space. One of the most important test access features
is to program the private idcode SRAM of each core and shared SRAM if
needed. In the case of a SIMD programming paradigm, which means all the
core will run the same binary, an optimization of the daisy-chained JTAG
organization in our system can be made. When the goal is to have identi-
cal operations to all 14 cores through DAP, it would be wasteful to shift 14
times the number of bits of a program. The optimization is to broadcast
TDI instead of chaining TDIs and TDOs. To be more specific, the tile TDI
is connected directly to the TDIs of all the DAPs of the cores, while the tile
TDO is only connected to one TDO out of the 14 TDOs from the DAPs of
the cores. In this configuration, the external test probe effectively only sees
one device, or a chain length of 1. When doing identical operations to all
the cores and the read-out does not matter, this configuration can effectively
shorten the load time by 14 times. One downside of this broadcast scheme is
that there is no simple way to get all the TDOs out to the external testing
probe without increasing the number of I/Os significantly. That is why this
configuration should only be used when the read-out of the register through
TDO does not matter.
To combine both the daisy-chained configuration and broadcast configu-
ration, multiplexers are inserted before all the TDIs of all the cores except
core 0, since core 0’s TDI will always be connected to tile TDI in both con-
figurations. These multiplexers are controlled by a new tile I/O signal, called
broadcast, which selects between the TDO of the previous core and tile TDI.
There is another multiplexer to select between the last core’s TDO and core
0’s TDO to the tile TDO, which is controlled by the same tile I/O signal as
the TDI multiplexers. This seems to be useless since, in both configurations,
27
the tile TDO can be connected directly to the last core’s TDO. However, this
multiplexer does enable the external test probe to receive two possible TDO
signals in case faults occur in the taped-out die.
The daisy-chained configuration in the tile is also used to chain up multiple
tiles on the wafer. Specifically, since TCK and TMS from an external test
probe need to connect to all the TCK and TMS, on-die bypassing channels
are used. Then the TDI from the external test probe goes to the tile 0’s
TDI, tile 0’s TDO goes to tile 1’s TDI, tile 1’s TDO goes to tile 2’s TDI,
and so on. However, due to the same designs of the die, the last TDO of
the last die in the chain is on the other end of the wafer. To have that
TDO connected back to the external test probe, on-die bypassing channels
are considered again, though the channels are in the opposite direction of the
TCK and TMS channels. Simply using the on-die bypassing channels, even
with double inverter buffers, will cause a significant wire delay for the TDO
from the last tile to reach the external test probe.
IEEE Standard P1838 tries to solve this problem and some others. This
paragraph discusses the control mechanism proposed in IEEE Std P1838,
shown in Figure 4.2, but without the secondary TAP S2 or TRSTN signals
since they are not used in our design. The primary TAP is the four-wire
JTAG interface that connects to the external test probe if it is the first
device in the chain or simply connects to the previous device. The secondary
TAP S1 has a similar four-wire JTAG interface with a few caveats. TCK S1
and TMS S1 are directly from the primary TAP’s TCK and TMS since we
forgo the TCR signal along with multiplexer m4. TDO S1 is the output of
this device’s shift registers and will go to the primary TAP’s TDI of the next
device. TDI S1 is the input from the primary TAP’s TDO from the next
device. Multiplexer m5 selects between this device’s shift register output
and TDI S1 coming from the next device. The output signal of multiplexer
m5 serves as this device’s TDO for the primary TAP.
Following IEEE Standard P1838 [33], registers that are triggered by the
rising edge of TCK are inserted on every tile on the TDO path that brings the
TDO of the last tile to the external test probe. Notice in Figure 4.2, which
adopted rules similar to IEEE Std 1149.1, that incoming TDI data shall be
acquired on the rising edge of TCK, while TDO outputs shall change on
the falling edge of TCK, the falling-TCK-edge-triggered registers are used
whenever TDO leaves a device. These registers are already implemented in
28
Figure 4.2: IEEE Std P1838 serial control mechanism
the DAPs of the cores, which means a falling-TCK-edge-triggered register
needs to be added so that the timing of the signals can conform to IEEE
Standard P1838. We call this TDO loopback. Multiplexer m5 in Figure 4.2
is added to select between the TDO of the current tile and the TDO output
from the next tile, to connect to the tile’s I/O for TDO. This fully enables
TDO loopback from the faraway tiles to reach the first tile that the external
test probe is connected to. The control of multiplexer m5 is done through a
bit in a configuration register, which will be detailed in section 5.1 and section
5.3. Briefly, since the control signal for multiplexer m5 is from a register, after
reset, the default value will select this tile’s shift register output, so that the
external test probe can only see one device, and after configuration, the chain
length can be increased to two devices, three devices, and so on.
This paragraph will discuss a signal called chain select, which is used to
control the JTAG accessibility of a tile. The chain select signal is an input
I/O of a tile, which will be driven by either the external test probe or an
output I/O from the previous tile. This signal simply gates the TDI and
TCK of the tile. If chain select is low, both TDI and TCK of the tile will be
a constant low, which prevents all the DAPs of the cores in the tile to get
29
affected by the tile’s JTAG signals. The chain select I/O of the first tile is
directly controlled by the external test probe, while the chain select I/O of
the rest tiles is controlled by an output I/O of the previous tile, which will
be covered later during the discussion of configuration registers and boot-up
sequence.
On a finished waferscale system, there will be multiple LPC1768 boards,
each connected to multiple tiles on the edge of the wafer, allowing each board
to talk to multiple tile chains. Therefore, the LPC1768 will have one set of
outputs, including TDI, TMS, TCK, multiple chain select, broadcast signals,
and multiple TDO from multiple tile chains as inputs to the LPC1768 board.
4.2.2 Accessing the Test Infrastructure of a Waferscale
System
The JTAG class in the basic software is designed to handle a single core. In
the waferscale system, it is extended to handle a single JTAG chain that can
have multiple tiles. Since a single Mbed board can be connected to multiple
chains, a class variable, called curr chain, is introduced to keep track of which
chain the current JTAG class object is talking to. Accordingly, shiftBits()
was modified to check curr chain and select which TDO to use among the
multiple TDO inputs. Recall that the chain select signal gates TCK and
TDI of a certain chain, chain select signals are set up before a JTAG class
object is created to minimize possible glitches.
After a chain is selected, which is done simply by setting chain select sig-
nals and curr chain variable when creating a new JTAG object, multiple class
variables are used to select a core or multiple cores within a chain. These
variables are num tiles, pre shift, num cores and core op[].
The variables num tiles and pre shift are used to compensate for the negative-
edge triggered registers per tile introduced by IEEE Std P1383 [33] to miti-
gate the long wire problem with TDO loopback in a long chain. The variable
num tiles is set when creating a new JTAG object, and pre shift is a flag
variable. The idea is that due to the additional negative-edge triggered reg-
isters, additional dummy bits need to be shifted for the meaningful bits to be
shifted in the correct position. The software may call multiple shiftBits() in a
single shift IR or DR state, and ends the operation with a call of leaveState().
30
Therefore, the software modification is the following: initialize pre shift to 0
when a new JTAG object is created; at the beginning of the function shift-
Bits(), checks variable pre shift, if 0, shift n-1 dummy bits, with n being
num tiles, and set pre shift to be 1, if 1, do nothing; in function leaveState(),
set pre shift to 0. These are the only changes needed to shift in the correct
number of dummy bits to occupy the negative-edge triggered registers while
having the read-out and shift-in bits be in the correct positions.
The variables num cores and core op[], along with the control of the broad-
cast signal, select a set of cores for the software to talk to. The variable
num cores shows the number of devices in the current chain. Recall that
when the broadcast signal is high, a tile will be seen as a single device due
to the broadcasting of TDI and only having one returning TDO. Therefore,
if the broadcast signal is high, num cores should be the number of tiles in
the chain, otherwise, num cores should be the total number of cores in the
chain. The variable core op[] is a bit vector indicating which core(s) need
to be operated on. The size of core op[] is the same as num cores, meaning
it is used to select which tile(s) to operate on in broadcast mode and which
core(s) to operate on in non-broadcast mode.
A few functions are modified using core op[] to support addressing an
arbitrary number of cores while not affecting other cores in the same chain.
These functions are setIR() and shiftData(). Since all the registers based
on JTAG are shift registers, in a chain of multiple devices, bits for the last
devices need to be shifted first. The function setIR() first calls setState(‘i’),
which sets all the TAP FSM in the chain to the Shift IR state. The shifting
part of setIR() is now inside a loop with n iterations, with n being the
num cores. Inside the loop, core op[] of the current device is checked. If the
flag is true, the intended instruction bits are shifted in, otherwise, BYPASS
instruction is shifted in. Note that the first iteration of the loop corresponds
to the last, or the furthest, device in the chain. BYPASS instruction is used
to minimize the chain length while ensuring the core(s) that are not being
addressed will not get affected by other core(s). Recall if IR is BYPASS, the
corresponding DR is only 1-bit wide, while other values for IR (IDCODE,
DPACC, and APACC) result in a 32-bit or 35-bit wide DR.
The function shiftData() is modified in a very similar way. First, set-
State(‘d’) is called to set all the TAP FSM in the chain to the Shift DR
state. Then the shifting part is encompassed in an n-iteration loop, with n
31
being num cores. The first iteration of the loop corresponds to the last device
in the chain. Then core op[] of the current device is checked; if true, then the
code follows the basic 35-bit operation, otherwise, the software simply shifts




CHALLENGES IN A WAFERSCALE
SYSTEM
The sheer scale of a waferscale system and the fact that there is only one
set of external connections among a large number of tiles present a set of
unique challenges. Some notable ones are the following: power delivery to all
the chiplets across the wafer, clock distribution of such a large area (around
15,000 mm2), and initialization and configuration of various components in
the system. Section 5.1 discusses how to configure various components and
access to several hardware counters through a set of memory-mapped regis-
ters. Section 5.2 discusses the problem of clock distribution in such a large
system, and proposed structures and strategies to solve the problem. Consid-
ering proper resets, power delivery, clock distribution, and program loading
and execution, section 5.3 proposed a boot-up sequence for the waferscale
system controlled solely by the external debuggers connected to the edge of
the wafer. The details of power delivery and some other challenges like I/O
design and NoC design are outside the scope of this thesis.
5.1 Accessing Uncore Components
There are many components on the tile other than the cores, or uncore com-
ponents, that need to be configured post-assembly, during boot-up, or even
during or after program execution, for example, clock distribution circuits,
NoC routers, power management components, and DfT circuits. Some of the
uncore components have their own hardware counters. To facilitate configur-
ing the uncore components and provide easy access to the hardware counter
through an external debugger, 2 sets of memory-mapped registers were added
to the tile design. For simplicity, they are called configuration registers. Re-
call in sections 3.3.2 and 4.2, the only external connection of the system is
through the test infrastructure, and the test infrastructure is designed for
33
memory accesses. Memory-mapped register allows easy configuration and
access to the hardware counter through simple write or read to a memory
address, meaning both external debugger and on-tile cores can access these
configuration registers.
There are 2 sets of 32 32-bit registers. One set is readable and writable
through the tile shared bus matrix. They will be referred to as r/w registers.
The other set is only readable, i.e., not writable through the tile shared bus
matrix. They will be referred to as read-only registers. The outputs of the
r/w registers are usually connected to some components of the tile and the
values are used to configure those modules. All the registers will be reset
to 0s by the tile-wide power-on reset. So, the default configurations, i.e., all
0s from the r/w registers, are made sure to be valid for the corresponding
modules. The full list of the r/w registers that are currently in use is shown
in Table 5.1. The inputs of the read-only registers mainly come from some
performance counters of the tile components and some status signals. The
full list of the read-only registers that are currently in use is shown in Table
5.1.
To access these registers through a single port on the tile shared bus matrix
and properly do address checking, a wrapper of the registers is developed.
As mentioned in the shared memory address mapping, the address space for
the registers starts at 0x40000000. Each register occupies 4 consecutive ad-
dresses, following the overall byte-addressable convention. For example, reg-
ister 0 is addressable at 0x40000000, register 1 is addressable at 0x40000004
and so on. The registers corresponding to the lowest 32 x 4 addresses are the
r/w registers, and the following 32 x 4 addresses are the read-only registers.
To simplify register accesses through memory addresses and given that the
bus matrix uses 32-bit transfers, the last two bits of the memory address are
ignored and will be set to 0s. In this way every read or write access will be
a full 32-bit register read or write.
Recall in section 4.1.1, the shared bus matrix uses the ARM AMBA 3
AHB protocol. The main purpose of the register wrapper is to adapt a bus
transfer to a read/write operation of a register. There are a few signals from
the master used in a basic AHB transfer, namely HCLK, HADDR[31:0],
HWRITE, HRDATA[31:0](slave) or HWDATA[31:0], and HREADY. HCLK
is the bus clock and it is driven by the same clock that synchronizes the
whole tile. A single transfer consists of two phases: the address phase and
34







0x40000004 r/w[1] Clock forwarding configuration
0x40000008 r/w[2] Router input arbiter configuration
0x4000000C r/w[3] Local loopback arbiter configuration
0x40000010 r/w[4] Depacketizer arbiter configuration
0x40000014 r/w[5] Depacketizer2 arbiter configuration
0x40000018 r/w[6] xy mesh output arbiter configuration
0x4000001C r/w[7] yx mesh output arbiter configuration
0x40000020 r/w[8] Depacketizer CAS type configuration
0x40000024 r/w[9] Tile ID being addressed
0x40000028 r/w[10] PLL configuration
0x4000002C r/w[11] Router buffer configuration
0x40000030 r/w[12] Router stubbing configuration
0x40000034 r/w[13] Slow clock forwarding configuration
0x4000003C r/w[14] Async router configuration
0x40000040 r/w[15] JTAG loopback configuration
0x40000044 r/w[16] Tile ID of the current tile
0x4000004C r/w[17] Clock phase shifter & duty cycle corrector con-
figuration
0x40000050 r/w[18] coreresetn release counter
0x40000054 r/w[19] presetn release counter
0x40000080 r[0] Queue occupancy stats
0x40000084 r[1] xy mesh input message counter upper 32 bits
0x40000088 r[2] xy mesh input message counter lower 32 bits
0x4000008C r[3] xy mesh output message counter upper 32 bits
0x40000090 r[4] xy mesh output message counter lower 32 bits
0x40000094 r[5] yx mesh input message counter upper 32 bits
0x40000098 r[6] yx mesh input message counter lower 32 bits
0x4000009C r[7] yx mesh output message counter upper 32 bits
0x400000A0 r[8] yx mesh output message counter lower 32 bits
0x400000A4 r[9] pll locked
the data phase. The address phase lasts for a single HCLK cycle. The data
phase might require several HCLK cycles and the HREADY signal is used
to control the number of clock cycles required to complete the transfer. The
timing diagrams of a basic read transfer and a basic write transfer are shown
in Figures 5.1 and 5.2 respectively.
There are two sets of signals that are useful but not shown in the timing
35
Figure 5.1: Timing diagram of a read transfer
Figure 5.2: Timing diagram of a write transfer
diagram above, namely HSEL and HTRANS[1:0]. When both HSEL and
HTRANS[1] are high, the address phase is valid. The two signals along with
HREADY start the address phases.
The registers’ interface is very simple and standard, which consists of four
(sets of) input signals, clk, rst n, data in[31:0], write, and one set of output
signals, data out[31:0]. The registers use single-cycle read/write based on
the one-bite write signal.
Based on the two sides of a read/write operation, an adaptor can be de-
veloped. The translation of the AHB transfer uses an abh to sram module
from the ARM development kit as a reference. When detecting the address
phase, the wrapper will do a few things. First, it will take in the memory
address and determine whether the address belongs to an r/w register, a
read-only register, or outside the register range. Second, the wrapper takes
in HWRITE, which indicates a read if the signal is 0 or a write if the signal
is 1. Some combinations of the memory address and read/write request are
not valid. These include a read or write request to an out-of-range mem-
ory address and a write request to a read-only register. The wrapper will
36
do a proper error response back to the shared bus matrix if those address
and read/write request combinations happen. If a valid memory address and
read/write request combination occurs, i.e., a read or write request to an r/w
register or a read request to a read-only register, the wrapper will determine
the corresponding register of the given memory address, take in HWDATA
if it is a write request to an r/w register and make a write to the register, or
do a read of the register and put the 32-bit read-out value to HRDATA.
5.2 Clock Distribution
The bypassing channels were designed to effectively broadcast some signals,
like the TCK and TMS signals for JTAG. A clock signal is another signal
that needs to reach all the tiles. In our case, a crystal oscillator on the wafer
will be used as the clock source for the whole wafer. However, the clock
signal cannot simply use the on-die bypassing channels since the delays will
be high enough to cause a significant phase shift of the clocks between tiles
and the clock signal could die out after a few tiles distance.
The idea to mitigate and potentially solve the problem of providing clock
signals to all the tiles on the wafer with only one crystal oscillator is for
each tile to produce a stable clock signal based on some clock signals coming
from the neighboring tiles, and forwarding the newly produced stable clock
signal to some neighboring tiles. To do this, there are two clock ports, one
for input and one for output, on each of the four sides, namely, north, south,
east, and west. These ports allow for an arbitrary scheme of forwarding. For
example, if the crystal oscillator is located at the top-left/north-west corner
of the wafer, then a forwarding scheme where each tile takes in clock signals
from top/north and left/west sides and forwards clocks to bottom/south and
right/east side can be used so that every tile has a stable clock.
The clocking infrastructure of a tile mainly consists of a PLL, a frequency
divider, and a clock forwarding module. The PLL is an IP from Analog Bits
that takes in a signal as the clock source and produces a stable clock signal
based on some configuration bits, which come from an r/w register. The
frequency divider is another IP and will divide the frequency of a clock sig-
nal based on some configuration. The clock forwarding module is developed
by the author. There are four clock inputs, a reset input, a few inputs for
37
configuration, five clock outputs, and one locked status output. The configu-
ration bits consist of a 4-bit receive bit vector to indicate which clock(s) from
the four directions will be used as inputs, another 4-bit forward bit vector to
indicate which of the four directions will be forwarded, and a 32-bit counter
value. The five clock outputs consist of four clocks for each direction and one
for the current tile. The main function of the clock forwarding module is to
count positive clock edges of the input clocks and set the output clocks to
the input clock that reaches the counter value first. When one of the input
clocks reaches the counter value given by the input configuration bits, the
output locked status bit will be set. Based on the bit vector for which of
four directions to forward, the corresponding clock output will be set to the
selected input clock and the clock output for the current tile will always be
set to the selected input clock as well. The 4-bit receive bit vector and the
4-bit forward bit vector should be mutually exclusive since it does not make
sense to receive from and then forward to the same direction. But once the
clock forwarding module is locked to an input, it will not care about another
clock input anyway.
Then considering how the clock frequencies should work in the real system,
the connections and organization of the modules will be discussed. Consider
the tile that will take in the crystal oscillator’s frequency and forward some
sort of clock signal to the neighboring tile. The PLL must take the crystal
oscillator’s output as the source. In general, the PLL will be configured to
have an output frequency to input frequency ratio of 30. For example, if
the input clock frequency is 8 MHz, then the output clock frequency will
be 240 MHz. The stable output clock of the PLL, which will be called the
PLL clock from now on, should go through the frequency divider to have
a slower frequency clock ready for the neighboring tiles. In this case, the
clock forwarding unit is not used, and the slow output clock of the frequency
divider will drive the output clock ports of the four directions. As for the tiles
other than the tile connected to the crystal oscillator, some of their input
clock ports will have slow clocks forwarded by the neighboring tiles. Since
there may be multiple input clocks, the clock forwarding module will be used
to pick a clock from only one direction as the clock source. Then that locked
clock source will be fed to the PLL to generate a faster clock as the tile’s
clock. To forward clocks to the neighboring tiles, either the output directly
from the clock forwarding module or the output of the frequency divider can
38
be used. The output of the frequency divider might be preferred since that
signal is passed through a PLL and frequency divider, which means the signal
should be more stable than the ones from the clock forwarding module which
already went through long distances.
However, the infrastructure is designed to have more options available
through configuration. For example, the tiles can be configured to always
forward the fast output clock and directly use the received clock as the tile’s
clock. A lot of the scheme could lead to problems similar to those discussed
at the beginning of this section, thus the slow clock forwarding scheme should
be used.
5.3 Bootup Sequence
This section first introduces some important reset signals, then proposes a
boot-up sequence combining all the considerations discussed before.
There are three reset signals for a tile, namely power-on reset, or poreset
for short, core/system reset, and clock forwarding reset. All three resets are
active low reset, although the clock forwarding reset is slightly different.
The power-on reset is intended to be asserted when a tile is first pow-
ered on. This reset will reset almost every component that utilizes storage
elements. These components include the PLL, the shared bus matrix, all
the cores and their components, all the configuration registers, the adapters
between the shared SRAMs and the share bus matrix, all the DAP related
components, and all the NoC related components. The only components not
affected by the power-on reset are the clock forwarding module and both
private and shared SRAMs.
Core/system reset is an input to a core, which essentially will reset the
program counter of that core to the default value. It is useful to restart a
core’s execution without resetting other components or to prevent the core
from starting the execution of instructions.
The clock forwarding reset signal is used to reset the clock forwarding
module and start counter clock edges when released. It has more functionality
than a simple active low reset. Since the clock forwarding module deals with
clock signals, the configuration bits will be registered internally instead of
directly coming from the outputs of a configuration register. This is done to
39
prevent the configuration bits from changing during counting. Additionally,
the reset signal for the clocking forwarding module must come from outside
the tile, specifically the MBED board outside the wafer; the fewer signals,
and thus the fewer I/O pins, the better. Therefore, when a negative edge
of the clock forwarding reset signal occurs, two things will happen. First,
the internal registers will latch in the configuration bits from a configuration
register, and second, the counters will be reset to 0s. Afterward, when the
clock forwarding reset signal is high, meaning the reset is complete or not
resetting anymore, the clock forwarding module can operate normally to
count clock edges and select one clock to forward based on the configuration.
One important thing to do after the dies are done in a foundry is to test
the dies individually before bonding them onto a wafer. For the pre-bonding
test, the die will be directly connected to testing hardness, mainly including
an MBED board. The I/O pads for the pre-bonding tests are comparatively
very large and sacrificial, so it is important to keep the number as low as
possible. During the development stage when starting to consider design
changes for pre-bonding tests, the tile IDs of the current tile were designed
to be hardwired through I/O pads, which led to 10 I/O pads dedicated to
a single functionality, which does not make sense for tests for a single die.
The other I/O pads that are needed for pre-ponding tests include all the
power-on reset, the core/system reset, and JTAG signals, which in total is
less than 10 I/O pads. The tile IDs are used to gate a few tile-wide signals
that allow for the software to address a single tile within the wafer without
affecting other tiles. To have fewer large I/O pads dedicated for pre-bonding
tests, one signal is added to bypass tile ID checking, which means the 10
tile ID I/O pads are no longer needed. This pre-bonding test signal, or tile
test for short, is also useful when dealing with the boot-up sequence of the
whole wafer. To be more specific, without the tile test signal, both power-
on reset and core/system reset are gated by checking the tile IDs that the
JTAG software is trying to address against the current tile IDs, so that every
individual tile can be power-on reset or core/system reset without any other
tile being reset. Since there is only one tile during the pre-bonding tests, the
tile IDs are not needed. Later in the development, it is decided to have the
tile IDs of the current tile to be either hard wired or programmed through a
configuration register. Programmable current tile IDs can make the bonding
of die and wafer easier and allow for options if the bonds of the hardwired
40
tile IDs are defective.
After the wafer first gets power, many things need to be done before it
can start executing a program. These include proper resets of tiles and
their components, proper configurations of the modules that need them, and
program loading since the waferscale system is designed to have its program
and data reside in the SRAMs. Both configurations and program loading
are done through the JTAG infrastructure since JTAG allows access to the
whole memory space across the wafer from outside the wafer.
After the wafer gets power, the initial power-on reset should be asserted to
reset all the components, since upon first powering up, memory elements like
the registers can have random values which can be invalid. The configuration
registers of all the tiles will be reset to 0s upon the power-on reset and they
need to be configured properly for the wafer to work as intended.
First, the default JTAG chain is to be in loopback mode, meaning each
tile will have its JTAG chain start from its tile TDI and end with its tile
TDO. In this default state, the MBED board can only see the tile that it
is directly connected to, which will be referred to as tile 0, since there is no
JTAG chain formed between any neighboring tiles. Recall that multiplexer
m5 in Figure 4.2 is used to select between the TDO of the current tile and
the TDO output from the next tile, to connect to the tile’s I/O for TDO.
In the configuration registers, r/w register 15 bit 0 is used to control this
multiplexer m5, with default value 0 to select TDO of the current tile, as
well as drive the chain select signal for the next tile. Using JTAG to set
r/w register 15 at memory address 0x4000003C of tile 0 to be 0x00000001,
either with broadcast mode or chain mode, the hardware JTAG chain can
now reach the first two tiles. Then it is possible to program r/w register 15
of tile 1, the “next tile” of tile 0, and form the JTAG chain that connects 3
tiles. As this process continues, the MBED board and its JTAG software can
reach all the tiles that are wired together. The design choice—to have the
JTAG chain of every tile to have default loopback and have to go through
the “unloop” process to reach the whole tile chain instead of default to have
the whole chain connected—is to have some remedy if a tile is faulty in the
middle of the chain. In this case, the tiles before the faulty tile can still form
a chain and get used. If the default is to have the whole chain connected,
then one faulty tile could lead to the whole chain being broken.
Once the JTAG chains are formed, the MBED boards have full access to the
41
memory space across the wafer, and then all the tiles need to be configured
and all the SRAMS need to be loaded. There are a few configuration registers
that need to be written to with some valid values other than the default 0s
for the wafer to work properly. These registers include clock forwarding and
PLL related r/w registers for all the tiles across the wafer to get a clock,
and the current tile ID registers for the NoC routers to work. The private
SRAMs need to be loaded with program binaries, and all the shared SRAMs
can be loaded with additional data if needed.
One important thing to notice during the boot-up phase is that there are no
valid clock signals before the relevant configuration registers are configured
and the clock forwarding happens across the whole wafer. Without a clock
signal, nothing on the wafer will work, including configuring those clock
related registers. To address this problem, the TCK signal among the JTAG
signals is proposed to be used as the clocking signal before the proper clocks
are set up. JTAG’s TCK signal is by no means a periodic signal with 50%
duty cycle since it is driven by the JTAG software and is supposed to only
be toggled once at a time when the TAP controller needs to transition states
or data need to be shifted. However, JTAG’s TCK can be toggled manually.
After the normal JTAG operations, the memory accesses, including memory
addresses, read or write requests, and data if needed, are still inside the
DAP if the clock signal for the tile is not toggled. Therefore, JTAG’s TCK
needs to be manually toggled for enough cycles for the memory accesses to
complete. This manual toggling of JTAG’s TCK must be inserted to multiple
places of the JTAG software to ensure the memory accesses can be completed
only using JTAG’s TCK as the clocking signal. Since there are two clocking
signals that can be used to synchronize a tile, a multiplexer is used to select
between JTAG’s TCK and the clock signal coming from the clock forwarding
infrastructure. The tile test signal will also force the tile to run at JTAG’s
TCK.
As for program loading, the MBED board will first wipe all the SRAM
memory space by writing 0s. Then the program will read the proper compiled
program file, in our case .elf file, and translate the given program file to
memory write operations.
Now that all the operations of the boot-up after power-on have been ad-
dressed, the proposed sequence of these operations along with the global
signals’ states/levels will be discussed.
42
Before power-on, all the resets, namely power-on reset, core/system reset
and clock forwarding reset, are held at 1s since they are active-low resets,
all the JTAG signals are held at 0s and tile test signal is held at 0. After
power-on, tile test will be set to 1 since the tile IDs are all 0s and JTAG’s
TCK will be used as the clocking signal. Then core/system reset will be
set to 0 to prevent the cores from executing anything since the SRAMs are
uninitialized. Since one MBED board can be connected to multiple physical
tile chains, chain by chain, the tiles within a chain will “unloop” to let JTAG
software reach all the tiles of the physically connected tile chain. Now that
the MBED board can reach all the tiles of all the chains that it is connected
to, the following three things can happen in arbitrary order: program tile
IDs, configure clocking network, and program loading.
An example of tile IDs programming with tile IDs being (x,y) is to have the
tile at the top-left/north-west corner of the wafer to be tile(0,0), then the tile
to its right/east is tile(1,0) and the tile below it or to its south is tile (0,1). If
the wafer has a 32x32 tile array, then the tile at the bottom-right/south-east
corner of the wafer will have its tile IDs as (31,31).
An example of configuring the clocking network if the crystal oscillator is
at the top-left/north-west corner of the wafer the output is fed to tile(0,0),
using the convention of the tile IDs above, is to have the PLL of tile(0,0) to
use the output of the crystal oscillator as the input clock source, output a fast
clock as the tile clock, then the fast clock goes through the frequency divider,
and the output slow clock will be forwarded to tile(0,1) on the right/east side
and tile(1,0) on the bottom/south side. Tile(0,1) will be configured to only
receive the clock from its left/west as input clock, use the PLL to generate
a fast clock as the tile clock, then the fast clock goes through the frequency
divider and the slow clock will be forwarded to tile (0,2) on the right/east side
and tile(1,1) on the bottom/south side. Tile(1,0) will be configured to only
have the clock from its top/north side as input clock, use the PLL to generate
a fast clock as the tile clock, then the fast clock goes through the frequency
divider and the slow clock will be forwarded to tile (1,1) on the right/east side
and tile(2,0) on the bottom/south side. Tile(1,1) will be configured to receive
both clocks from its top/north side and left/west side, select one of them,
and the selected clock will go through its PLL and frequency divider and
then be forwarded to both its right/east side to tile(1,2) and bottom/south
side to tile(2,1). All the tiles with x=0, excluding tile(0,0), will receive one
43
clock from their left/west and forward to their right/east and bottom/south,
all the tiles with y=0 excluding tile(0,0) will receive one clock from their
top/north side and forward to their right/east and bottom/south, and all
the rest of the tiles will receive two clocks from both their top/north side
and left/west side and forward to their right/east side and bottom/south
side. With this scheme, the clock will start from the top-left/north-west
corner, and propagate to the bottom-right/south-east corner.
After all the clocking related r/w registers are configured, the clock for-
warding reset can be transitioned from the initial 1 to a 0, allowing the
internal registers of the clock forwarding module to latch in the configura-
tion bits, and resetting all the internal counters to 0s at the same time. Then
the clock forwarding reset can be pulled from 0 to 1 to start the forwarding
process. After enough time has passed, read-only register 9 of all the tiles
can be pulled to check whether their PLLs are locked and generating stable
clock signals for their tiles.
For program loading, if the program is written in a SIMD-like style, i.e.,
all the cores will have the same instructions but will execute differently based
on different core IDs and different data, then private SRAMs can be loading
with broadcast mode inside every tile. The length of the JTAG chain in
broadcast mode is the same as the number of the tiles, instead of the number
of tiles times 14. This can dramatically reduce the amount of time, specifi-
cally 14x, when loading the private SRAMs. As for shared SRAMs loading,
the broadcast mode can be used as well, even though it will result in the
memory locations within the shared SRAMs being written to 14 times. This
tradeoff is worth it since the JTAG software is written in C++ and running
on the MBED board, the programming speed is relatively slow. Additionally,
the shared SRAMs being written to 14 times will be handled by the shared
bus matrix and will not block the JTAG software. One important thing
when doing private SRAM loading is that they are only accessible when the
core/system reset is not asserted, i.e., at 1. So, what must be done is to re-
lease the core/system reset and then write to DHCSR (debug halting control
and status register) of all the cores to halt them, then process to program
the private SRAMs. However, there are many cycles between releasing the
core/system reset and completion of writes to DHCSR; the program counter
will increase and some garbage values inside the private SRAMs after power-
on will be loaded into the core. Another core/system reset will need to be
44
asserted to make sure the core will be executing the correct program.
To shorten the boot-up time, the sequence after “unlooping” the tiles could
be the following: first configure the clocking network without checking PLL
lock, do the clock forwarding resets, i.e., switch from 1 to 0 and 1, program
the tile IDs of all the tiles, and then do the PLL locking. The PLL lock
checking should be very fast since the clock propagates at the same time
as the MBED board is programming the tile IDs. The tile test signal can
be set to 0 to allow all the tiles running on its PLL clock, which is much
faster than JTAG’s TCK. Next program loading can be done by releasing
core/system reset, halting all the cores, and writing to the private SRAMs
and shared SRAMs if needed. After the above sequence is done, a wafer-wide
core/system reset is needed to let all the tiles of the wafer restart relatively
simultaneously. However, program functionality should not rely on all the
tiles starting simultaneously since the reset signal propagating through the





Along with the RTL development, multiple RTL simulation strategies were
implemented. The RTL simulation efforts include writing scripts for multiple
simulation software and development of testbenches. Simulations were car-
ried out on multiple stages over the chip development cycle, which includes
function simulation, post-synthesis simulation with multiple delay modes,
and post-place-and-route simulation with multiple delay modes. Multiple
FPGA prototypes were developed at various stages, for example, prototypes
of the system with a single tile containing a single core, a single tile contain-
ing multiple cores, and multiple tiles, each containing multiple cores. All the
simulations and FPGA prototypes were used to verify the test infrastructure
design and the overall system.
6.1 RTL Simulation
Before the simulations, testbenches need to be developed to drive device-
under-test (DUT). Initially, before the development started, since there is no
debugging infrastructure or clock forwarding, the testbench was very simple.
It only drives a clock, instantiates a “mesh” consisting of X by Y tiles, and
toggles a global reset signal at the beginning. Memory content is preloaded
through a binary file. The tile IDs are hard-wired during instantiation. The
core/system reset was the same signal as the power-on reset, so there was
only one reset. In that time, the design is essentially solely tested by the test
program preloaded into the SRAMs.
With the development of the debugging infrastructure, the testbench needs
to drive the JTAG protocol. Therefore, most of the functions in the JTAG
software written in C++ were rewritten in SystemVerilog so that the func-
tionality of the software can be done through a SystemVerilog testbench. All
46
the C++ functions in the software were rewritten as tasks in SystemVerilog.
In C++, consecutive statements are executed sequentially, but in a Verilog or
SystemVerilog testbench, consecutive statements are executed concurrently
if there are no artificial delays. Therefore, SystemVerilog tasks were used as
they are capable of encompassing delays into a function. When the MBED
executes the JTAG software written in C++, the delay between each state-
ment, or more importantly, the delay between the changes of the signals
driven by the software may not be deterministic. However, in SystemVer-
ilog, delays are multiples of the time unit. The SystemVerilog version of the
JTAG software may not be a replica of the C++ version, which will be used
in the final wafer, but it will be sufficient for simulation purposes.
The SystemVerilog testbenches drive all the signals external to the wafer,
including an input clock, the resets, and JTAG protocol signals. During
a typical simulation run, the testbench will execute the boot-up sequence
discussed above, and then let some programs run. Then the testbench can
halt the programs and read out values from memory locations to check the
correctness of program execution.
Since the main purpose of the simulations is to verify that the design
works, the testbench can exercise a lot of tests that may or may not be
exercised by a typical run. For example, all the JTAG tasks have built-in self-
checking. Multiple versions of the clock forwarding scheme are included. Full
program loading is usually not executed due to the long simulation time it
will incur. Therefore, a relatively short memory read/write test was written.
The memory test will do at least one read and one write, with correctness
checking, to a memory location that belongs to the private SRAMs or the
shared SRAMs where the program loading should happen, thus confirming
the ability to do program loading.
The testbench can also directly generate network traffic across the mesh
by writing to special memory locations that belong to the packetizer. This
allows the NoC components to be tested without any program loaded into
the private SRAMs of the cores.
There are multiple software programs written in C and C++ that can
be loaded into the SRAMs, which help test the overall functionality of the
waferscale system. For example, a traffic generator program can generate
several traffic patterns across the two meshes. Specifically, it can generate
read, write, or CAS requests over the xy and/or the yx mesh, with a con-
47
figurable number of in-flight messages, with patterns like random uniform
and hotspot (all the tiles make requests to a single tile). Furthermore, a ba-
sic graph application program that implements a breadth-first search (BFS)
helps test the waferscale system in a more general manner. The BFS program
utilizes a vertex-centric programming paradigm. One core across the whole
wafer will be a checker or aggregator, the rest of the cores will be workers.
A work queue that contains the vertices pending to be processed is in the
shared SRAMs of the tile where the checker/aggregator is, and the graph
data is distributed across the shared SRAMs of the wafer. The work queue
will be updated atomically through CAS requests by all the cores. The BFS
program will output the number of vertices visited at the end, which is used
to check the correct execution of the program.
Since this is a tape-out project, after the RTL development, the design
must go through synthesis and implementation to conform to a foundry’s
process. Simulations are done after every step. The following section will
discuss the differences between the simulations of the steps.
For RTL functional simulation, all signal changes happen at the clock edges
essentially. For example, one clock signal, without any PLL or buffering, can
perfectly synchronize all the components no matter how large the tile mesh is.
This is the baseline for verification as it only verifies that the RTL functions
work as intended.
In general, both post-synthesis simulation and post-implementation simu-
lation are gate-level simulation, but they do have some differences.
The synthesis tool takes in the RTL design of a chip and tries to recre-
ate the design with only the cells provided by the foundry’s standard cell
library. The standard cell library can have simple gates like inverter, two-
input NAND, 2-to-1 multiplexer, and flip-flop. One important thing that
needs to be mentioned is that some IPs do not go through the synthesis tool
because they are either analog circuits or need to be implemented with a
process other than the standard cell library. In this design, the PLL and the
SRAMs are the IPs that belong to that category. In post-synthesis simu-
lations, there can be multiple delay modes. The closest to RTL functional
simulation is zero delay mode, which still has all the signal changes at clock
edges. The main difference is that in RTL functional simulation, the sig-
nals go through each module at a time, but in post-synthesis simulation, one
original RTL module can consist of multiple standard cells. Additionally,
48
with optimization steps done by the synthesis tool, unused signals or part
of a module can be optimized out. The second delay mode is by manually
adding a fixed delay to all flip-flops. For example, all the flip-flops can have
a fixed tcq (time delay between the clock edge to output change) of 50ps.
The last delay mode is using standard delay format (SDF) annotation, which
encompasses both interconnect or wire delays and cell or gate delays. SDF
shows the rise time and fall time from one input to one output of a cell or
interconnect under three corners: fast, typical, and slow. At the beginning
of the file, it also shows what operating conditions are for the three corners.
The operating conditions typically include voltage and temperature. The
fast corner usually corresponds to the highest voltage and the lowest tem-
perature, while the slow corner usually has the lowest voltage and the highest
temperature.
The implementation step involves a huge amount of physical design effort,
including but not limited to floor planning, layout, clock tree design and
implementation, power delivery system design and implementation, and I/O
pin design and implementation. Essentially, the implementation step takes in
the post-synthesis netlist, adding necessary designs and turning the project
into a format, usually GDSII, that a foundry can directly use to tape out the
silicon chips. For post-implementation simulations, the netlists can be even
larger than the post-synthesis ones due to the additional necessary designs
and the delay information in the SDF file is much more accurate. There can
still be three delay modes, the same as post-synthesis simulation, namely zero
delay mode, fixed delay mode, and SDF annotation. In this case, SDF anno-
tation is the most important one. It can help to verify maximum frequency
determined by static timing analysis (STA) and finding timing violations.
Compared to FPGA prototyping or emulations, simulations have the ad-
vantages of probing signals and performing signal tracing, X-propagation,
and can simulate a fairly large system mainly depending on RAM capac-
ity while running relatively slowly. Most of the simulations are done using
Cadence Incisive simulator. It is a single-threaded program regarding simu-
lation tasks. While running on a server with an AMD EPYC 7451 24-core
processor at 2.3 GHz with 256 GB of RAM, the post-implementation simu-
lation of a 3-by-3 tile mesh, each tile with full 14 cores, can take around 10
hours for a boot-up sequence without any additional tests like the memory
read and write tests to finish, and another 2 hours for a traffic generator
49
program to reach equilibrium.
6.2 FPGA Prototype
Another verification method used alongside simulation is FPGA prototyping.
Doing this allows the design to run on some kind of silicon and communicate
with the MBED board running JTAG software. A tiny portion of the final
wafer was turned into a Xilinx Vivado project targeting the Zedboard, a
development board for the Xilinx Zynq-7000 All Programmable SoC. The
JTAG software is running on the ARM MBED LPC1768 board, which is
connected to the Zedboard via GPIOs on the MBED board side and PMOD
connector, used as GPIO pins, on the Zedboard side. An example of the setup
is shown in Figure 6.1. To make the project work nicely on a Xilinx FPGA,
there are a few things that need to be modified in the RTL, namely the PLL
and the SRAMs. The PLL is replaced with a Xilinx clocking IP, which serves
a similar purpose but cannot be configured after instantiation. Due to device
limitations, there can only be one such Xilinx clock IP instantiated for the
whole design. The SRAMs are replaced with equivalent Xilinx BRAM IPs,
with smaller capacity due to device limitation. Specifically, the 64KB private
SRAM for each core is replaced by an 8KB BRAM and each of the 128KB
shared SRAM banks is replaced by a 16KB BRAM. The Zedboard can only
fit a 2-by-2 tile mesh with 1 core in each tile or a 2-by-1 tile mesh with 2
cores in each tile. The boot-up sequence is slightly different because there
can only be one PLL for the whole project. The tiles can either directly use
the PLL output clock without any clock forwarding for simplicity or do some
simple clock forwarding bypassing the PLL and frequency divider part. Full
program loading can be done through JTAG as well because of the smaller
sizes of the SRAMs. There are a few benefits when using a development
board. For example, multiple LEDs can be used for quick debugging, and
multiple switches and buttons can be used for quick resets.
FPGA prototyping is good for testing infrastructure verification because
the hardware setup is closer to the intended hardware setup with MBED
boards and a wafer populated with chiplets. The I/O pins of the Xilinx
Vivado project are mapped to the corresponding physical locations; for ex-
ample, the input clock from the crystal oscillator is mapped to the crystal
50
Figure 6.1: An example FPGA-MBED setup
oscillator on the Zedboard, and all the external I/O pins, including the ex-
ternal resets and the JTAG signals, are mapped to the PMOD connectors on
the Zedboard which are configured to be used as GPIO. The PMOD connec-
tor is connected to the GPIO pins of the MBED board through wires. The
51
MBED runs the JTAG software and is connected to a computer through a
serial port. On the computer, the MBED can display debugging informa-
tion, gather user inputs for some step, like the size of the mesh and number
of cores currently on the FPGA, and display program execution results by
reading from specific memory locations.
Even though the FPGA prototyping only has a tiny portion, i.e., 2-by-1
tile mesh with 2 cores on each tile, compared to the full-fledged waferscale
system, estimated 22-by-22 tile mesh with 14 cores on each tile, or the de-
cent size with RTL simulation, typically 4-by-4 tile mesh with 14 cores each
tile, it can still verify the essential function of the system while running on
silicon. It can more accurately represent the behavior of JTAG interaction
than simulation, like a daisy chain with more than one device. It can also
test the functionality of the cores and NoC since there can be more than one
core and more than one tile. It can also verify that a simple clock forwarding




With the breakdown of Dennard scaling, increasing the number of cores in
a system quickly became the main effort to improve computational perfor-
mance, especially as parallel applications like machine learning, big data
processing, and cloud computing gain more and more popularity. Novel
architectures and integration technologies greatly support this trend of in-
creasing core count. Increasing the number of cores by simply increasing the
size of a die is not cost-effective if the die size is larger than a certain die
area because manufacturing defects lead to a low yield of large area dies.
Chiplet-based design is the current trend to build a large system that is cost-
effective to manufacture. One important aspect of a chiplet-based system is
the testing of the die(s) at various stages, including pre-assembly die testing,
post-assembly die testing, and overall system testing.
This thesis presented a testing infrastructure in a chiplet-based waferscale
system. This testing infrastructure expands upon the basic testing compo-
nents for a single-core system to support multiple dies, each with multiple
cores with minimal I/O pin requirement for each die. It enables an external
debugger to initiate memory accesses to an arbitrary number of cores across
an arbitrary set of chains. The discussion of the testing infrastructure consists
of both hardware organization and some details of the software used by the
external debugger to access the system. Furthermore, this thesis addresses
several challenges in building the waferscale system and proposed solutions to
said challenges. The proposed testing infrastructure can also apply to other
novel integration technologies that use multiple dies, for example, 3D-stacked
ICs and even monolithic 3D IC.
There are a few improvements that could be made to the work reported in
this thesis. For example, faster program loading software could be developed,
specifically for a scenario where different cores need different binaries. It
can be done by modifying the loadELF function, which currently takes in a
53
.elf format file and translates the inputs to a binary to be loaded into the
private SRAM of a core. When loading programs, the software will simply
load one program at a time. Overheads are introduced through this method
because the cores that need other programs are getting nothing and will take
up JTAG clock cycles due to chaining (BYPASS registers are 1-bit wide).
This can be dealt with by changing the loadELF function to take in all
the .elf format files, construct a two-dimensional matrix with the x-axis as
core IDs and the y-axis as memory locations. The content of the matrix is
the binary for all the cores. The programming loading will then shift all the
binaries for all the cores at one memory location each time. This can improve
the program loading speed further when there are more distinct binaries for
all the cores. Furthermore, to verify test infrastructure for multiple tiles,
an FPGA prototype ideally could use multiple FPGA development boards
instead of fitting the system into one. Multiple FPGA development boards
can be connected through wires.
The test infrastructure presented in this thesis is designed to use as few
I/O pads as possible. JTAG data signals are daisy-chained with some simple
circuit for broadcasting, which leads to a few compromises. For example,
during the broadcast mode, a memory read will only return the value from
one core inside a tile and all the other 13 read responses cannot get back to the
debugger. Serialization/de-serialization circuits inside a tile can be explored
to keep the minimal number of I/O pads while enabling a more parallel
approach to access all the 14 cores by not chaining the JTAG data signals
of those cores. If the number of I/O pads is less of a constraint, which can
be achieved by using advanced connectors or serialization/de-serialization
circuits at the edge or outside a waferscale system, the DfT organization




[1] J. S. Kilby, “The integrated circuit’s early history,” Proceedings of the
IEEE, vol. 88, no. 1, pp. 109–111, 2000.
[2] G. E. Moore, “Cramming more components onto integrated circuits,”
Electronics, vol. 38, no. 8, 1965.
[3] G. E. Moore, “Progress in digital integrated electronics,” in Electron
Devices Meeting, vol. 21, 1975, pp. 11–13.
[4] R. H. Dennard, F. H. Gaensslen, H. Yu, V. L. Rideout, E. Bassous, and
A. R. LeBlanc, “Design of ion-implanted MOSFET’s with very small
physical dimensions,” IEEE Journal of Solid-State Circuits, vol. 9, no. 5,
pp. 256–268, 1974.
[5] M. Bohr, “A 30 year retrospective on Dennard’s MOSFET scaling pa-
per,” IEEE Solid-State Circuits Society Newsletter, vol. 12, no. 1, pp.
11–13, 2007.
[6] B. A. Nayfeh and K. Olukotun, “A single-chip multiprocessor,” Com-
puter, vol. 30, no. 9, pp. 79–85, 1997.
[7] L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer,
B. Sano, S. Smith, R. Stets, and B. Verghese, “Piranha: A scalable
architecture based on single-chip multiprocessing,” in Proceedings of
27th International Symposium on Computer Architecture (IEEE Cat.
No.RS00201), 2000, pp. 282–293.
[8] “Power 4: The first multi-core,
1GHz processor,” IBM. [Online]. Available:
https://www.ibm.com/ibm/history/ibm100/us/en/icons/power4/
[9] N. P. Kronenberg, H. M. Levy, and W. D. Strecker, “VAXclus-
ter: A closely-coupled distributed system,” ACM Trans. Comput.
Syst., vol. 4, no. 2, p. 130–146, May 1986. [Online]. Available:
https://doi.org/10.1145/214419.214421
[10] “About Blue Waters,” National Center for Supercomputing Applica-
tions at University of Illinois at Urbana-Champaign. [Online]. Available:
http://www.ncsa.illinois.edu/enabling/bluewaters
55
[11] J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenk-
ins, H. Wilson, N. Borkar, G. Schrom, F. Pailet, S. Jain, T. Ja-
cob, S. Yada, S. Marella, P. Salihundam, V. Erraguntla, M. Konow,
M. Riepen, G. Droege, J. Lindemann, M. Gries, T. Apel, K. Henriss,
T. Lund-Larsen, S. Steibl, S. Borkar, V. De, R. V. D. Wijngaart, and
T. Mattson, “A 48-core IA-32 message-passing processor with DVFS in
45nm CMOS,” in 2010 IEEE International Solid-State Circuits Confer-
ence - (ISSCC), 2010, pp. 108–109.
[12] A. Arunkumar, E. Bolotin, B. Cho, U. Milic, E. Ebrahimi, O. Villa,
A. Jaleel, C.-J. Wu, and D. Nellans, “MCM-GPU: Multi-Chip-Module
GPUs for continued performance scalability,” in Proceedings of the 44th
Annual International Symposium on Computer Architecture, ser. ISCA
’17. New York, NY, USA: Association for Computing Machinery,
2017. [Online]. Available: https://doi.org/10.1145/3079856.3080231 p.
320–332.
[13] J. F. McDonald, E. H. Rogers, K. Rose, and A. J. Steckl, “The trials of
wafer-scale integration: Although major technical problems have been
overcome since WSI was first tried in the 1960s, commercial companies
can’t yet make it fly,” IEEE Spectrum, vol. 21, no. 10, pp. 32–39, 1984.
[14] G. H. Loh, “3D-Stacked memory architectures for multi-core proces-
sors,” in 2008 International Symposium on Computer Architecture, 2008,
pp. 453–464.
[15] J. Jeddeloh and B. Keeth, “Hybrid memory cube new DRAM architec-
ture increases density and performance,” in 2012 Symposium on VLSI
Technology (VLSIT), 2012, pp. 87–88.
[16] D. U. Lee, K. W. Kim, K. W. Kim, H. Kim, J. Y. Kim, Y. J. Park, J. H.
Kim, D. S. Kim, H. B. Park, J. W. Shin, J. H. Cho, K. H. Kwon, M. J.
Kim, J. Lee, K. W. Park, B. Chung, and S. Hong, “25.2 A 1.2V 8Gb 8-
channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with
effective microbump I/O test methods using 29nm process and TSV,”
in 2014 IEEE International Solid-State Circuits Conference Digest of
Technical Papers (ISSCC), 2014, pp. 432–433.
[17] V. Kumar and A. Naeemi, “An overview of 3D integrated circuits,” in
2017 IEEE MTT-S International Conference on Numerical Electromag-
netic and Multiphysics Modeling and Optimization for RF, Microwave,
and Terahertz Applications (NEMO), 2017, pp. 311–313.
[18] Y. Chuang, C. Yuan, J. Chen, C. Chen, C. Yang, W. Changchien,
C. C. C. Liu, and F. Lee, “Unified methodology for heterogeneous inte-
gration with CoWoS technology,” in 2013 IEEE 63rd Electronic Com-
ponents and Technology Conference, 2013, pp. 852–859.
56
[19] S. Y. Hou, W. C. Chen, C. Hu, C. Chiu, K. C. Ting, T. S. Lin, W. H.
Wei, W. C. Chiou, V. J. C. Lin, V. C. Y. Chang, C. T. Wang, C. H. Wu,
and D. Yu, “Wafer-level integration of an advanced logic-memory system
through the second-generation CoWoS technology,” IEEE Transactions
on Electron Devices, vol. 64, no. 10, pp. 4071–4077, 2017.
[20] R. Mahajan, R. Sankman, N. Patel, D. Kim, K. Aygun, Z. Qian,
Y. Mekonnen, I. Salama, S. Sharan, D. Iyengar, and D. Mallik, “Em-
bedded multi-die interconnect bridge (EMIB) – a high density, high
bandwidth packaging interconnect,” in 2016 IEEE 66th Electronic Com-
ponents and Technology Conference (ECTC), 2016, pp. 557–565.
[21] S. K. Moore, “Huge chip smashes deep learning’s speed barrier,” IEEE
Spectrum, vol. 57, no. 1, pp. 24–27, 2020.
[22] S. Jangam, S. Pal, A. Bajwa, S. Pamarti, P. Gupta, and S. S. Iyer,
“Latency, bandwidth and power benefits of the SuperCHIPS integration
scheme,” in 2017 IEEE 67th Electronic Components and Technology
Conference (ECTC), 2017, pp. 86–94.
[23] S. Pal, D. Petrisko, A. A. Bajwa, P. Gupta, S. S. Iyer, and R. Kumar, “A
case for packageless processors,” in 2018 IEEE International Symposium
on High Performance Computer Architecture (HPCA), 2018, pp. 466–
479.
[24] S. Pal, D. Petrisko, M. Tomei, P. Gupta, S. S. Iyer, and R. Kumar,
“Architecting waferscale processors - A GPU case study,” in 2019 IEEE
International Symposium on High Performance Computer Architecture
(HPCA), 2019, pp. 250–263.
[25] R. Arnold, S. M. Menon, B. Brackett, and R. Richmond, “Test methods
used to produce highly reliable known good die (KGD),” in Proceedings.
1998 International Conference on Multichip Modules and High Density
Packaging (Cat. No.98EX154), 1998, pp. 374–382.
[26] H. D. Thacker, M. S. Bakir, D. C. Keezer, K. P. Martin, and J. D.
Meindl, “Compliant probe substrates for testing high pin-count chip
scale packages,” in 52nd Electronic Components and Technology Con-
ference 2002. (Cat. No.02CH37345), 2002, pp. 1188–1193.
[27] M. Bushnell and V. Agrawal, Essentials of Electronic Testing for Digital,
Memory and Mixed-signal VLSI Circuits. Springer Science & Business
Media, 2004, vol. 17.
[28] S. Mitra, E. J. McCluskey, and S. Makar, “Design for testability and
testing of IEEE 1149.1 TAP controller,” in Proceedings 20th IEEE VLSI
Test Symposium (VTS 2002), 2002, pp. 247–252.
57
[29] “Arm debug interface architecture specification,” April 2018, Arm
Limited, 110 Fulbourn Road, Cambridge, England CB1 9NJ. [Online].
Available: https://developer.arm.com/documentation/ihi0031/latest/
[30] IEEE Standard for Test Access Port and Boundary-Scan Architecture,
IEEE Std 1149.1-2013, 2013.
[31] “ARM Cortex-M3 processor technical reference man-
ual,” November 2016, Arm Limited. [Online]. Available:
https://developer.arm.com/documentation/100165/0201
[32] S. Davidson, S. Xie, C. Torng, K. Al-Hawai, A. Rovinski, T. Ajayi,
L. Vega, C. Zhao, R. Zhao, S. Dai, A. Amarnath, B. Veluri, P. Gao,
A. Rao, G. Liu, R. K. Gupta, Z. Zhang, R. Dreslinski, C. Batten, and
M. B. Taylor, “The Celerity open-source 511-core RISC-V tiered acceler-
ator fabric: Fast architectures and design methodologies for fast chips,”
IEEE Micro, vol. 38, no. 2, pp. 30–41, 2018.
[33] E. J. Marinissen, T. McLaurin, and Hailong Jiao, “IEEE Std P1838:
DfT standard-under-development for 2.5D-, 3D-, and 5.5D-SICs,” in
2016 21th IEEE European Test Symposium (ETS), 2016, pp. 1–10.
58
