University of Pennsylvania

ScholarlyCommons
Publicly Accessible Penn Dissertations
2022

Virtualizing Reconfigurable Architectures: From Fpgas To Beyond
Yue Zha
University of Pennsylvania

Follow this and additional works at: https://repository.upenn.edu/edissertations
Part of the Computer Engineering Commons

Recommended Citation
Zha, Yue, "Virtualizing Reconfigurable Architectures: From Fpgas To Beyond" (2022). Publicly Accessible
Penn Dissertations. 5418.
https://repository.upenn.edu/edissertations/5418

This paper is posted at ScholarlyCommons. https://repository.upenn.edu/edissertations/5418
For more information, please contact repository@pobox.upenn.edu.

Virtualizing Reconfigurable Architectures: From Fpgas To Beyond
Abstract
With field-programmable gate arrays (FPGAs) being widely deployed in data centers to enhance the
computing performance, an efficient virtualization support is required to fully unleash the potential of
cloud FPGAs. However, the system support for FPGAs in the context of the cloud environment is still in its
infancy, which leads to a low resource utilization due to the tight coupling between compilation and
resource allocation. Moreover, the system support proposed in existing works is limited to a
homogeneous FPGA cluster comprising identical FPGA devices, which is hard to be extended to a
heterogeneous FPGA cluster that comprises multiple types of FPGAs. As the FPGA cloud is expected to
become increasingly heterogeneous due to the hardware rolling upgrade strategy, it is necessary to
provide efficient virtualization support for the heterogeneous FPGA cluster.
In this dissertation, we first identify three pairs of conflicting requirements from runtime management and
offline compilation, which are related to the tradeoff between flexibility and efficiency. These conflicting
requirements are the fundamental reason why the single-level abstraction proposed in prior works for the
homogeneous FPGA cluster cannot be trivially extended to the heterogeneous cluster. To decouple these
conflicting requirements, we provide a two-level system abstraction. Specifically, the high-level
abstraction is FPGA-agnostic and provides a simple and homogeneous view of the FPGA resources to
simplify the runtime management and maximize the flexibility. On the contrary, the low-level abstraction is
FPGA-specific and exposes sufficient low-level hardware details to the compilation framework to ensure
the mapping quality and maximize the efficiency. This generic two-level system abstraction can also be
specialized to the homogeneous FPGA cluster and/or be extended to leverage application-specific
information to further improve the efficiency. We also develop a compilation framework and a modular
runtime system with a heuristic-based runtime management policy to support this two-level system
abstraction. By enabling a dynamic FPGA sharing at the sub-FPGA granularity, the proposed virtualization
solution can deploy 1.62x more applications using the same amount of FPGA resources and reduce the
compilation time by 22.6% (perform as many compilation tasks in parallel as possible) with an acceptable
virtualization overhead, i.e.,
Finally, we use Liquid Silicon as a case study to show that the proposed virtualization solution can be
extended to other spatial reconfigurable architectures. Liquid Silicon is a homogeneous reconfigurable
architecture enabled by the non-volatile memory technology (i.e., RRAM). It extends the configuration
capability of existing FPGAs from computation to the whole spectrum ranging from computation to data
storage. It allows users to better customize hardware by flexibly partitioning hardware resources between
computation and memory based on the actual usage. Instead of naively applying the proposed
virtualization solution onto Liquid Silicon, we co-optimize the system abstraction and Liquid Silicon
architecture to improve the performance.

Degree Type
Dissertation

Degree Name
Doctor of Philosophy (PhD)

Graduate Group
Electrical & Systems Engineering

First Advisor
Jing Li

Keywords
Cloud computing, FPGA, RRAM, Virtualization

Subject Categories
Computer Engineering

This dissertation is available at ScholarlyCommons: https://repository.upenn.edu/edissertations/5418

VIRTUALIZING RECONFIGURABLE ARCHITECTURES:
FROM FPGAS TO BEYOND

Yue Zha

A DISSERTATION
in
Electrical and Systems Engineering
Presented to the Faculties of the University of Pennsylvania
in
Partial Fulfillment of the Requirements for the
Degree of Doctor of Philosophy
2022

Supervisor of Dissertation

Graduate Group Chairperson

Jing Li, Associate Professor

Alejandro Ribeiro, Professor

Electrical and Systems Engineering

Electrical and Systems Engineering

Committee:
André DeHon, Professor of Electrical and Systems Engineering
University of Pennsylvania
Zhiru Zhang, Associate Professor of Electrical and Computer Engineering
Cornell University
Jing Li, Associate Professor of Electrical and Systems Engineering
University of Pennsylvania

VIRTUALIZING RECONFIGURABLE ARCHITECTURES:
FROM FPGAS TO BEYOND

COPYRIGHT

2022

Yue Zha

ACKNOWLEDGMENT
First and foremost, I would like to thank my advisor, Prof. Jing Li. I have been extremely
fortunate to join her group and have the opportunity to work with her. I have learned a
lot from her over the last several years. Her expertise, insight, and seemingly unlimited
support have guided this work in the best of ways. Thank you!
I also want to thank the members of my defense committee: Prof. André DeHon and
Prof. Zhiru Zhang. They provide valuable feedback and comments that largely improve
the quality of this dissertation.
I am also grateful to my colleagues in PennCIL group for the cherished time spent
together in the lab. In particular, I would like to thank Jialiang Zhang and Nick Beckwith,
who have taught me many import techniques and provided much support in both research
and life.
I would also like to thank my parents for their invaluable support, encouragement, and
unwavering belief in me. Without you, I would not be the person I am today.
Finally, I would like to thank my wife Mingxi for her love and for all the late nights
and early mornings. This work would not come to a successful end without her constant
support.

iii

ABSTRACT
VIRTUALIZING RECONFIGURABLE ARCHITECTURES:
FROM FPGAS TO BEYOND
Yue Zha
Jing Li
With field-programmable gate arrays (FPGAs) being widely deployed in data centers
to enhance the computing performance, an efficient virtualization support is required to
fully unleash the potential of cloud FPGAs. However, the system support for FPGAs in
the context of the cloud environment is still in its infancy, which leads to a low resource
utilization due to the tight coupling between compilation and resource allocation. Moreover,
the system support proposed in existing works is limited to a homogeneous FPGA cluster
comprising identical FPGA devices, which is hard to be extended to a heterogeneous FPGA
cluster that comprises multiple types of FPGAs. As the FPGA cloud is expected to become
increasingly heterogeneous due to the hardware rolling upgrade strategy, it is necessary to
provide efficient virtualization support for the heterogeneous FPGA cluster.
In this dissertation, we first identify three pairs of conflicting requirements from runtime
management and offline compilation, which are related to the tradeoff between flexibility
and efficiency. These conflicting requirements are the fundamental reason why the singlelevel abstraction proposed in prior works for the homogeneous FPGA cluster cannot be
trivially extended to the heterogeneous cluster. To decouple these conflicting requirements,
we provide a two-level system abstraction. Specifically, the high-level abstraction is FPGAagnostic and provides a simple and homogeneous view of the FPGA resources to simplify the
runtime management and maximize the flexibility. On the contrary, the low-level abstraction is FPGA-specific and exposes sufficient low-level hardware details to the compilation
framework to ensure the mapping quality and maximize the efficiency. This generic twolevel system abstraction can also be specialized to the homogeneous FPGA cluster and/or be
extended to leverage application-specific information to further improve the efficiency. We
iv

also develop a compilation framework and a modular runtime system with a heuristic-based
runtime management policy to support this two-level system abstraction. By enabling a
dynamic FPGA sharing at the sub-FPGA granularity, the proposed virtualization solution
can deploy 1.62× more applications using the same amount of FPGA resources and reduce
the compilation time by 22.6% (perform as many compilation tasks in parallel as possible)
with an acceptable virtualization overhead, i.e., < 10% degradation in single application’s
performance.
Finally, we use Liquid Silicon as a case study to show that the proposed virtualization
solution can be extended to other spatial reconfigurable architectures. Liquid Silicon is a
homogeneous reconfigurable architecture enabled by the non-volatile memory technology
(i.e., RRAM). It extends the configuration capability of existing FPGAs from computation
to the whole spectrum ranging from computation to data storage. It allows users to better
customize hardware by flexibly partitioning hardware resources between computation and
memory based on the actual usage. Instead of naively applying the proposed virtualization
solution onto Liquid Silicon, we co-optimize the system abstraction and Liquid Silicon
architecture to improve the performance.

v

TABLE OF CONTENTS

ACKNOWLEDGMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iii

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

LIST OF ILLUSTRATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.1

Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Motivation

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.3

Target Service Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.4

Prior Virtualization Solutions and Our Goals . . . . . . . . . . . . . . . . .

5

1.5

Overview and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2 FPGA Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.1

FPGA Architecture

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.2

FPGA Compilation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.3

Partial Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.4

FPGA Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.5

Cloud Instance Characterization . . . . . . . . . . . . . . . . . . . . . . . .

18

3 System Abstraction for Cloud FPGAs . . . . . . . . . . . . . . . . . . . . .

19

3.1

Design Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

3.2

Two-Level System Abstraction . . . . . . . . . . . . . . . . . . . . . . . . .

24

vi

3.2.1

FPGA Overlay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

3.2.2

Virtual-to-Physical Mapping . . . . . . . . . . . . . . . . . . . . . .

29

3.2.3

Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . .

30

3.3

Specialized to a Homogeneous Cluster . . . . . . . . . . . . . . . . . . . . .

33

3.4

Case Study: Extend to Support Application-Specific ISA

. . . . . . . . . .

36

3.5

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

3.5.1

Two-Level System Abstraction . . . . . . . . . . . . . . . . . . . . .

40

3.5.2

Single-Level System Abstraction . . . . . . . . . . . . . . . . . . . .

44

3.5.3

Creating Multiple Types of Physical Blocks . . . . . . . . . . . . . .

46

3.5.4

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

4 Compilation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

4.1

Compilation Framework for Two-Level Abstraction . . . . . . . . . . . . . .

51

4.1.1

Recursive Partition Process . . . . . . . . . . . . . . . . . . . . . . .

57

4.2

Compilation Framework for Single-Level Abstraction . . . . . . . . . . . . .

59

4.3

Compilation Framework for Application-Specific ISA . . . . . . . . . . . . .

62

4.3.1

Decomposing Step . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

4.3.2

Partition Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

4.4.1

Compilation Time . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

4.4.2

Compilation Quality . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

4.4.3

Case Study: AS ISA-based Accelerator . . . . . . . . . . . . . . . .

79

5 Scheduling and Resource Management . . . . . . . . . . . . . . . . . . . . .

84

4.4

5.1

Modular Runtime System . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

5.1.1

Specialized for A Homogeneous FPGA Cluster . . . . . . . . . . . .

86

5.2

Task Scheduling Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

86

5.3

Resource Allocation Policy

. . . . . . . . . . . . . . . . . . . . . . . . . . .

87

Possible Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

5.3.1

vii

5.4

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

5.4.1

Design Space Exploration on Parameter N and K . . . . . . . . . . .

90

5.4.2

Improvement Over Non-virtualized Environment . . . . . . . . . . .

92

5.4.3

Comparison between Variants of Two-Level System Abstraction . .

93

6 Extend to Liquid Silicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

6.1

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

6.1.1

RRAM and Access Device . . . . . . . . . . . . . . . . . . . . . . . .

96

6.1.2

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

98

Liquid Silicon Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

6.2.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

6.2.2

Configuration Modes . . . . . . . . . . . . . . . . . . . . . . . . . . .

102

6.2.3

Comparison With FPGAs . . . . . . . . . . . . . . . . . . . . . . . .

107

6.2.4

Circuit Implementation . . . . . . . . . . . . . . . . . . . . . . . . .

108

Custom Compilation Framework . . . . . . . . . . . . . . . . . . . . . . . .

117

6.3.1

Adaptive Resource Partition . . . . . . . . . . . . . . . . . . . . . .

120

Chip Demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

121

6.4.1

Operational Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . .

123

6.4.2

Write Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

131

6.4.3

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

133

6.5

Extend Virtualization Solution . . . . . . . . . . . . . . . . . . . . . . . . .

134

6.6

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

136

6.6.1

Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

136

6.6.2

Traditional FPGA Benchmarks . . . . . . . . . . . . . . . . . . . . .

140

6.6.3

Search-intensive Applications . . . . . . . . . . . . . . . . . . . . . .

142

6.6.4

Neural Network Benchmarks . . . . . . . . . . . . . . . . . . . . . .

143

6.6.5

Chip Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

144

6.6.6

Virtualization Evaluation . . . . . . . . . . . . . . . . . . . . . . . .

149

6.2

6.3

6.4

viii

7 Conclusion
7.1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

Limitation and Possible Future Works . . . . . . . . . . . . . . . . . . . . .

153

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

ix

LIST OF TABLES
1.1 A comparison of prior virtualization support for FPGAs and our design
goals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

3.1 Resources provided by one physical block and the maximum communication bandwidth provided by the intra-die and inter-die interconnections. 41
3.2 The amount of resources exposed to users. . . . . . . . . . . . . . . . . . 50
4.1 The resource usages of evaluated benchmarks. . . . . . . . . . . . . . . .
4.2 Hardware implementation results of the two baseline accelerators. . . .
4.3 The latency of LSTM/GRU inference tasks. . . . . . . . . . . . . . . . .
6.1
6.2
6.3
6.4
6.5

Description of the search-intensive benchmark
Topology for BNN benchmarks. . . . . . . . .
Liquid Silicon Chip Specification . . . . . . . .
Comparison with state-of-the-art AI chips . .
Comparison with nv-FPGA . . . . . . . . . . .

x

71
80
82

set. . . . . . . . . . . . . 138
.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

139
146
147
149

LIST OF ILLUSTRATIONS
1.1 (a) A conceptual diagram to illustrate the management method used
in existing FPGA clouds. It only supports a static resource allocation
due to the lack of an abstraction, leading to an inefficient resource utilization. (b) Several works [13][21] (including the low-latency mode of
AmorphOS [67]) abstract FPGAs into a pool of slots to enable a dynamic resource allocation, thereby improving the resource utilization.
However, this improvement could be limited due to the internal fragmentation issue. Moreover, users need to manually partition applications and handle the inter-slot communication if they cannot fit into
one slot. (c) The high-throughput mode of AmorphOS enables FPGA
sharing by combining multiple applications during the compilation process. However, this method does not decouple compilation and resource
allocation, and neither makes full use of FPGA resources due to the
lack of multi-FPGA support. . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 (a) The amount of resource used by several representative FPGA applications (C-LSTM [129], DeltaRNN [43], BNN 1 [163], BNN 2 [83],
FPGP [31], GraVF [37], GraphOps [100], ForeGraph [32]). The results
are normalized to the capacity of Xilinx VU13P FPGA. (b) The FPGA
capacity keeps growing due to technology advances. . . . . . . . . . . .
1.3 A conceptual diagram illustrates the service models defined for cloud
FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 A conceptual diagram of the island-style FPGA architecture. The key
building blocks of one CLB in Xilinx UltraScle FPGA is drawn as an example. Note that the commercial-grade FPGA architecture introduces
additional features that are not drawn in this diagram for simplicity. .
2.2 A conceptual diagram to illustrate the clock distribution network in the
prior FPGAs (left) and the UltraScale FPGA (right). . . . . . . . . . .
2.3 A conceptual diagram illustrates the additional architectural features to
support the multi-die package. The routing fabric and hard IP blocks
are not drawn for simplicity. . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 (a) A typical FPGA compilation flow that comprises a front-end and
a back-end. (b) A conceptual diagram to illustrate the tight coupling
between compilation results and resource allocation. Specifically, different spatial resource constraints of the allocated FPGA resource lead to
distinct compilation results. . . . . . . . . . . . . . . . . . . . . . . . . . .
xi

2

3
5

12
13

14

15

2.5 A conceptual diagram to illustrate the constraints when creating multiple partial reconfigurable regions in Xilinx FPGAs. Specifically, the
partial reconfigurable region #1 and #2 can co-exist on one FPGA device, while region #1 and #3 cannot be created on the same FPGA
device. The reconfigurable resources are not drawn for simplicity. . . .
2.6 Conceptual diagrams illustrate the popular integration methods for
FPGAs, which are (a) tightly integrated with CPU in the same package or on the same board, (b) connected to CPU through PCIe, or (c)
directly attached to the datacenter network. (d) Commercial FPGA
clouds [40][4] typically use a hybrid method, which is also the integration method targeted in this dissertation. . . . . . . . . . . . . . . . . . .
3.1 A conceptual diagram illustrated the basic structure of the system abstraction and the mapping process. . . . . . . . . . . . . . . . . . . . . .
3.2 A conceptual diagram illustrates the conflicting requirements on the system abstraction. Specifically, a homogeneous system abstraction (top)
provides portability across different heterogeneous FPGA clusters, but
has non-negligible resource waste due to the mismatched spatial resource constraints. On the contrary, a heterogeneous abstraction with
specialized virtual blocks can achieve a high resource utilization at the
cost of no portability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 A conceptual diagram to illustrate the difference between an asynchronous interface (top) and an synchronous interface (bottom). Specifically, the asynchronous interface enables a dynamic runtime deployment
at the cost of additional buffers and control logic. On the contrary, the
synchronous interface can be efficiently implemented but only support
a static deployment that is determined at offline compile time. . . . . .
3.4 A conceptual diagram illustrates the two-level system abstraction for
a heterogeneous FPGA cluster. (a) The high-level abstraction comprises a pool of high-level virtual blocks (HL virtual blocks) that are
connected by an all-to-all network. An asynchronous interface is provided for the inter-block communication. One HL virtual block has
no spatial resource constraint to hide the heterogeneity across FPGAs.
(b) The low-level abstraction comprises multiple arrays of low-level virtual blocks (LL virtual blocks), where one array abstracts one type of
FPGA. One LL virtual block contains a certain amount of reconfigurable
resources that are organized in pre-defined spatial resource constraints.
A synchronous interface is provided for the intra-array communication,
while an asynchronous interface is also provided to implement the asynchronous interface in the high-level abstraction. . . . . . . . . . . . . . .
3.5 The physical FPGA is divided into Service Region and User Region to
support the two-level system abstraction. The virtualization support for
on-board DRAM is drawn as an example. Note that the actual layout
of these regions is tailored to a specific type of FPGA. . . . . . . . . . .

xii

17

18

20

22

23

26

28

3.6 A conceptual diagram illustrates the virtual-to-physical mapping, where
one HL virtual block is offline mapped into an array of LL virtual blocks
and then deployed into one FPGA of the corresponding type at runtime.
Multiple mapping results are offline generated for one HL virtual block
so that it can be deployed into different types of FPGAs at runtime. .
3.7 A conceptual diagram illustrates the benefits of the constrained mapping strategy (right). Specifically, this strategy ensures that the interblock timing does not change under the dynamic runtime deployment.
Moreover, this strategy also simplifies the interconnection network between physical blocks. On the contrary, the unconstrained mapping
strategy (left) has a varying inter-block timing under different runtime
deployment. Moreover, a dedicated interconnection needs to be provided for each pair of physical blocks to ensure that the inter-block
connections are not shared between applications, leading to a complex
interconnection network. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.8 (a) Multiple LL virtual block arrays with different combination of the
synchronous interfaces are provided for one type of FPGA to account
for the difference between the inter-die and intra-die communication
latency. (b) One LL virtual block array could be shared among a set
of FPGAs if these FPGAs reuses the same die design. This effectively
amortizes the compilation cost. The service region in physical FPGAs
is not drawn for simplicity. . . . . . . . . . . . . . . . . . . . . . . . . . .
3.9 A conceptual diagram illustrates that smaller physical blocks can be
created if multiple types of LL virtual blocks are provided for one type
of physical FPGAs (ignoring the heterogeneity caused by the multi-die
package). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.10 (a) A conceptual diagram illustrates the single-level system abstraction
specialized for the homogeneous FPGA cluster, which comprises a 1D
array of identical virtual blocks. (b) This single-level system abstraction
minimizes the compilation cost as one compilation result can be used
for different runtime deployments. Only physical blocks are drawn for
simplicity. (c) The single-level system abstraction and two-level system
abstraction effectively complement each other. Although they require
different FPGA overlays, they can co-exist in a homogeneous FPGA
cluster enabled by the FPGA’s programmability. . . . . . . . . . . . . .

xiii

28

31

32

33

35

3.11 A Communication Region is included to implement the latencyinsensitive interface. (a) As the width of the physical block is larger
than its height, placing the communication region between two physical
blocks reduces the interconnection length and supports more interconnections compared with placing the communication region on left/right
side of physical blocks. (b) The communication region needs to created
as partial reconfigurable regions to support various latency-insensitive
interface. This substantially reduces the number of physical blocks provided by one FPGA due to the constraint in creating partial reconfigurable regions. (c) Thus, we only create communication regions that
support inter-FPGA communications for the physical blocks on the top
and bottom. The communication regions in the middle only support
intra-FPGA communication and are created as static regions. Note
that the actual layout of these regions is tailored to a specific type of
FPGA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.12 (a) An application-specific abstraction layer can be added on top of the
two-level system abstraction to support application-specific ISA. This
additional layer comprises a pool of soft blocks. Same as the highlevel virtual blocks, these soft blocks also have variable spatial resource
constraints to simplify the partition process. (b) The soft block has a
multi-level tree structure, where one soft block can have an arbitrary
number of child blocks that are connected either in the data parallelism
or pipeline parallelism. (c) These two primitive parallel patterns are
sufficient to construct other complex patterns, such as the adder tree. .
3.13 A conceptual diagram to illustrate that the extracted parallel patterns
are leveraged to simplify the mapping from the additional abstraction
layer to the high-level abstraction layer. . . . . . . . . . . . . . . . . . . .
3.14 The commercial FPGA XCVU37P from Xilinx is partitioned into regions to support the two-level system abstraction. User Region that
is indexed with U is exposed to users, while the Service Region that
is indexed with S is reserved by the system. The circuits in the systemreserved regions are pre-implemented and cannot be modified by users.
The mapping results are obtained from Vivado 2020.1. . . . . . . . . . .
3.15 The commercial FPGA XCKU115 from Xilinx is partitioned into regions to support the two-level system abstraction. User Region that
is indexed with U is exposed to users, while the Service Region that
is indexed with S is reserved by the system. The circuits in the systemreserved regions are pre-implemented and cannot be modified by users.
The mapping results are obtained from Vivado 2020.1. . . . . . . . . . .

xiv

37

38

39

41

43

3.16 The commercial FPGA XCVU37P from Xilinx is partitioned into three
regions to support the single-level system abstraction. User Region
that is indexed with U is exposed to users, while the Service Region that is indexed with S and the Communication Region that is
indexed with C are reserved by the system. The circuits in the systemreserved regions are pre-implemented and cannot be modified by users.
The partition pins are only drawn for illustration purpose, which are
not the actual position. The mapping results are obtained from Vivado
2020.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.17 The commercial FPGA XCKU115 from Xilinx is partitioned into three
regions to support the single-level system abstraction. User Region
that is indexed with U is exposed to users, while the Service Region that is indexed with S and the Communication Region that is
indexed with C are reserved by the system. The circuits in the systemreserved regions are pre-implemented and cannot be modified by users.
The partition pins are only drawn for illustration purpose, which are
not the actual position. The mapping results are obtained from Vivado
2020.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.18 Two types of physical blocks are created on XCVU37P FPGA when implementing the two-level system abstraction. An additional sub-region
S-5 is created to share the DDR4/PCIe interface and the inter-FPGA
interconnection among these smaller physical blocks. The mapping results are obtained from Vivado 2020.1. . . . . . . . . . . . . . . . . . . .
3.19 Two types of physical blocks are created on XCKU115 FPGA when implementing the two-level system abstraction. An additional sub-region
S-4 is created to share the DDR4/PCIe interface and the inter-FPGA
interconnection among these smaller physical blocks. The mapping results are obtained from Vivado 2020.1. . . . . . . . . . . . . . . . . . . .
4.1 The compilation framework for the two-level system abstraction. The
steps using custom tools are highlighted in blue. . . . . . . . . . . . . .
4.2 A conceptual diagram illustrates the latency-insensitive interface generated for one HL virtual block. . . . . . . . . . . . . . . . . . . . . . . .
4.3 A conceptual diagram illustrates the process of mapping one high-level
virtual block onto physical FPGAs. The local routing step is not drawn
in the figure for simplicity. . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 One application is recursively partitioned into multiple HL virtual blocks.
4.5 A conceptual diagram to illustrate that improving K only leads to nonnegligible runtime performance improvement in limited scenarios. . . .
4.6 The compilation framework for the single-level system abstraction. The
steps using custom tools are highlighted in blue. . . . . . . . . . . . . .

xv

45

46

47

49

52
53

55
57
58
59

4.7 (a) A conceptual diagram to illustrate the the quality of the local placement step for single-level system abstraction is more sensitive to the
position of partition pins compared with that of the monolithic placement step in two-level system abstraction. This mainly because the local
placement step has a smaller placement region (one physical block) and
more partition pins. (b) An iterative partition method is applied to obtain a fine-grained partition results when mapping user logic into virtual
blocks. The fine-grained partition results are leveraged to determine the
position of partition pins. In the drawn example, the partition result
obtained from the third iteration is used to determine the position of
partition pins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8 When the inter-FPGA connection is implemented for one interface, then
the implementation of the remaining interfaces are all determined based
on the number of physical blocks provided by one FPGA. In the drawn
example, one FPGA provides N physical blocks. Then the control logic
for the intermediate blocks can be merged and implemented in the interFPGA connection. This figure only draws the implementation for one
dataflow (from top to bottom) for simplicity. . . . . . . . . . . . . . . . .
4.9 A conceptual diagram illustrates the decomposing flow, where (a) the
control and data path in one AS ISA-based accelerator design is first
separated into two soft blocks, and the soft block that contains data
path can be decomposed either in (b) a top-down flow or (c) a bottomup flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.10 Conceptual diagrams illustrate (a) the step of extracting the data parallelism within a leaf soft block, (b) the step of identifying inter-block
data parallelism, and (c) the step of identifying pipeline parallelism. .
4.11 (a) A conceptual diagram illustrates the technique of scaling down one
AS ISA-based accelerator. Specifically, one AS ISA-based accelerator
is split into two smaller one. Each one has a complete control path and
only computes part of the computation results. We provide a template
module for inter-FPGA synchronization (highlighted in blue). (b) The
key building blocks of this synchronization module are drawn in the figure.
4.12 A conceptual diagram illustrates the template architecture used for generating different variants of accelerator designs. A multi-level distribution network and pipeline registers are included for better timing. . . .
4.13 The runtime breakdown of different compilation process for the evaluated accelerator designs. For each accelerator design (small, medium
or large), from top to bottom, the runtime of the baseline compilation
flow, the compilation flow for two-level abstraction that has one type of
LL virtual block for one FPGA, the compilation flow for the single-level
abstraction, and the compilation flow for the two-level abstraction that
has two types of LL virtual blocks for one FPGAs are drawn. . . . . .
4.14 The aggregated compilation time of different compilation flows. . . . .

xvi

60

62

64

66

68

70

72
75

4.15 The operating frequency of the accelerators mapped onto the two-level
system abstraction, which is normalized to that mapped by the conventional FPGA flow. For each benchmark, the result of three accelerator
variants are provided (from left to right is large, medium and small). .
4.16 The average operating frequency obtained under different n values,
which is normalized to that mapped by the conventional FPGA compilation flow. The average partition time with different n values is also
reported. The parameter n is defined in Section 4.2. . . . . . . . . . . .
4.17 The required communication bandwidth of the inter-block interconnections when mapping applications onto the two-level system abstraction
and the single-level system abstraction. Enabled by the two-level system
abstraction, the corresponding compilation framework can effectively
identify the boundary with the low bandwidth requirement to partition
these benchmarks. On the contrary, due to the unified asynchronous
interface in the single-level abstraction, the corresponding compilation
framework cannot find such boundary. . . . . . . . . . . . . . . . . . . .
4.18 A conceptual diagram illustrates the organization of the AS ISA-based
accelerator design and the decomposing results. . . . . . . . . . . . . . .
4.19 The floorplanning is leveraged to improve the mapping quality of the
baseline accelerator. Part of the floorplanning used for XCVU37P FPGA
is shown in (a). This function is also leveraged to improve the mapping
quality of one virtual block to ensure a fair comparison. The optimized
implementation result is shown in (b). . . . . . . . . . . . . . . . . . . . .
4.20 The impact of the inter-FPGA communication latency on the inference
latency when the AS ISA-based accelerator is deployed onto two FPGA
devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1 (a) The two-level modular runtime management system for the heterogeneous FPGA cluster. The on-demand and spot instances are defined
in Section 2.5. (b) The single-level runtime management system for the
homogeneous FPGA cluster. . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 (a) A conceptual diagram illustrates that an inappropriate resource allocation leads to resource fragmentation issue. (b) A conceptual diagram
illustrates the calculation of the fragmentation score. Service region in
FPGAs is not drawn for simplicity. . . . . . . . . . . . . . . . . . . . . .
5.3 A conceptual diagram to illustrate the flow of allocating resources for
one application. Only one bottom-level manager is drawn for simplicity.
5.4 The normalized response time for on-demand and spot instances under
different N and K. The percentage of on-demand instances and batch
workloads are 50%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5 (a) The comparison of the normalized response time over the nonvirtualized environment for the heterogeneous FPGA cluster. (b) The
comparison of the normalized response time delivered by the two-level
system abstraction and single-level system abstraction for the homogeneous FPGA cluster. The results of the non-virtalized environment is
not drawn in (b) for better clarity. . . . . . . . . . . . . . . . . . . . . . .
xvii

76

77

78
80

81

83

85

87
88

91

92

5.6 The normalized response time under different percentages of (a) ondemand instances and (b) batch workloads. The resource contention
ratio is 0.9 in both experiments. The percentage of batch workloads is
50% in (a), and the percentage of on-demand instances is 50% in (b). .
5.7 The performance comparison between the two different variants of the
two-level system abstraction. . . . . . . . . . . . . . . . . . . . . . . . . .

94
95

6.1 (a) The Ir/T a2 O5−δ /T aOx /T aN structure [132] of one RRAM cell. (b)
The resistive switching I-V curve. . . . . . . . . . . . . . . . . . . . . . . 97
6.2 Liquid Silicon provides a user-controlled resource provisioning to cover
the whole spectrum, from data-intensive to compute-intensive. On the
contrary, FPGAs only provide an efficient support on compute-intensive
applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.3 (a) To improve resource utilization, one tile can be partitioned between
heavy-weight compute mode and interconnect mode. (b) This flexibility
results in better mapping with low routing pressure compared to FPGAs. 100
6.4 A conceptual diagram illustrates the Liquid Silicon architecture. 2 × 2
tiles are drawn in the example. In one tile, the 1D1R-based crossbar
array is stacked atop connection nodes (CMOS circuits) and does not
consume die area. The key building blocks of one connection node is
also drawn in the figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.5 One tile in the light-weight compute mode supports the parallel search
operation. The data entries can be stored either row-wise (left) or
column-wise (right). The matched entry is highlighted in blue. . . . . . 103
6.6 The light-weight compute mode is also used to implement the binarized
neural network, and the data layout in one tile can be either horizontally
or vertically. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.7 The operation of the heavy-weight compute mode is illustrated (left)
and four logic functions are packed and mapped onto one tile. The
operation of the interconnect mode is illustrated (middle). These two
modes can be co-existed in the same tile (right). . . . . . . . . . . . . . 105
6.8 An example illustrates (a) the read operation and (b) the write operation in the memory mode. This memory block stores 4 2-bit words. . . 106
6.9 Detailed implementation of one connection node. . . . . . . . . . . . . . 110
6.10 The implementation of the S/A design. . . . . . . . . . . . . . . . . . . . 111
6.11 (a) The voltages on WLs and the RRAM states are presented. The
corresponding discharge current for these two BLs are also drawn. (b)
The voltages on these two BLs. (c) The voltage on the node SN in
the S/A. (d) The output of S/A. (e) The output of the configurable
dynamic inverter, and (f) this output is latched by the reference timing
signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.12 Circuit design of the configurable dynamic inverter. . . . . . . . . . . . 112
6.13 (a) Circuit implementation of the non-volatile configuration memory,
and voltage setups for three operations are highlighted. (b) 3D2R cells
can be organized in a crossbar structure and the voltage setup to program one RRAM cell (in blue) is illustrated. . . . . . . . . . . . . . . . . 114
xviii

6.14 One example illustrates sensing operations when providing (a) one sensing clock or (b) two sensing clocks. . . . . . . . . . . . . . . . . . . . . . . 115
6.15 Physical design of a tile under 40nm technology. . . . . . . . . . . . . . 116
6.16 Workflow of the compilation framework. The back-end is modified to
support the features provided in Liquid Silicon. . . . . . . . . . . . . . . 118
6.17 (a) This Liquid Silicon test chip comprises a 2D array of identical tiles,
and each tile contains a 1T1R memory array and several connection
nodes. Note that the adjacent tile is rotated by 90 degree. The pitch
mismatch between WL and BL can be resolved in the connection node
through a two-metal transition routing network. (b) The schematic of a
1T1R memory cell, and (c) key building blocks of the connection node
are drawn in the figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.18 (a) The read data path for the sensing operation is drawn in the figure.
(b) The conceptual diagram illustrates the sensing operation. . . . . . . 123
6.19 (a, b) An arbitrary AND function is implemented on one row (BL). (c)
Multiple functions are implemented in one array with a compact mapping. 125
6.20 (a) Three adjacent tiles are used to implement a memory block that
stores 16 2-bit words. (b) The left tile implements the row address
decoder and routes column address to the central tile. (c) The central
tile implements the column address decoder and stores data. (d) The
bottom tile implements the selection logic to generate the read result. 127
6.21 The timing diagram of the read operation. . . . . . . . . . . . . . . . . . 128
6.22 (a) An optimal selection of the row/column address bits leads to a
32-bit (16 2-bit words) of memory capacity, as compared with (b) a
non-optimal one which leads to a 24-bit (12 2-bit words) of memory
capacity given the same area. (c) The optimal address selection for a
memory block with a 4-bit word size, which achieves 48-bit (12 4-bit
word) of memory capacity. The left/bottom tiles are not drawn in this
figure for simplicity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.23 One tile can be configured to perform parallel search operations. . . . . 129
6.24 One tile can be configured to implement binarized neural networks. . . 131
6.25 (a) The write data path for programming a selected 1T1R cell is drawn
in figure. The conceptual diagram illustrates the operation to (b) set
the selected RRAM into LRS, and (c) reset the selected RRAM into
HRS. (d) The output voltages of HV-drivers are summarized in the table. 132
6.26 (a) The distributions of the effective resistance for both match and
1-bit mismatch cases. (b) The maximum operating frequency, power
consumption, and array efficiency under different array sizes. (c) The
power efficiency and area efficiency for machine learning and big data
applications under different array sizes. . . . . . . . . . . . . . . . . . . . 133
6.27 A conceptual diagram illustrates low-level abstraction modified for Liquid Silicon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.28 A conceptual diagram illustrates the hybrid routing fabric in the modified Liquid Silicon. The cluster contains 2 × 2 tiles in the example. . . 136
6.29 The compilation framework for the virtualized Liquid Silicon. . . . . . 137

xix

6.30 From top to bottom are Area, delay, energy efficiency (energy-delay
product, EDP) and routing usage results. Results of the SRAM-based
FPGA are used as baseline, and other results are normalized to them.
The routing usage is the ratio between routing area and total used area
when mapping benchmark circuits. In Liquid Silicon, it is obtained
by first calculating the ratio between routing area and total used area
(routing+logic) of each tile and averaging across all tiles. . . . . . . . .
6.31 The area saving (top), throughput improvement (middle) and power
reduction (bottom) are presented. All results are normalized to that of
SRAM-based FPGA. The area result is plotted in logarithmic scale. . .
6.32 The runtime speedup (top), energy consumption (middle) and area (bottom) results are presented. All results are normalized to that of SRAMbased FPGA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.33 (a) Die photo and (b) the integration flow [50]. . . . . . . . . . . . . . .
6.34 (a) The measured resistance distribution under the switching condition: Forming 4V@40µs, SET 2V@100ns, RESET 2.5V@100ns, (b) the
measured voltage frequency scaling, and (c) the measured waveform for
logic ‘1’ output (Computation mode: ‘True’, Storage mode: ‘1’, Search
mode: ‘match’, NN mode: ‘active’) and logic ‘0’ output (Computation
mode: ‘False’, Storage mode: ‘0’, Search mode: ‘mismatch’, NN mode:
‘inactive’). These measurements are conducted at room temperature. .
6.35 The area, delay and EDP (energy-delay product) results of mapping
application onto the virtualized Liquid Silicon architecture, which are
normalized to those of the non-virtualized one. . . . . . . . . . . . . . .
6.36 The delay result (left) and the number of tracks per unit length (right)
of mapping application onto the virtualized Liquid Silicon, which are
normalized to that of the FPGA architecture. . . . . . . . . . . . . . . .
6.37 The runtime of the compilation framework developed for the virtualized environment is normalized to that of the framework for the nonvirtualized one. Results of sequentially executing all compilation tasks
(top) and parallel executing all tasks (bottom) are presented. Only the
key compilation tasks are drawn in the figure for simplicity. . . . . . . .

xx

141

142

144
145

145

150

150

151

Chapter 1
Introduction
1.1

Thesis

A system abstraction and a compilation framework are developed for the heterogeneous
FPGA cluster in the context of the cloud environment. By enabling a dynamic FPGA
sharing in the spatial domain at the sub-FPGA granularity and providing multi-FPGA
support, the proposed virtualization solution can deploy 1.62× more applications using the
same amount of FPGA resources with a marginal degradation in the single application’s
performance (no more than 10%) compared to that in the non-virtualized environment. This
virtualization solution can also be extended to other spatial reconfigurable architectures,
such as Liquid Silicon, a RRAM-based reconfigurable architecture.

1.2

Motivation

Integrating field-programmable gate arrays (FPGAs) into cloud infrastructures to enhance
their computing performance is one important trend in recent years, mainly because of the
high energy efficiency, predictable latency, and the superior flexibility of accelerating diverse
applications, including machine learning [23][117][159], data analysis [82][41][63] and graph
processing [31][32][37]. However, the system support for FPGAs in the context of the cloud
environment is still in its infancy. Consequently, the resource management strategy used in
the embedded computing environment is adopted in existing FPGA clouds (e.g., Amazon
AWS F1 [4]) to manage the pool of FPGA resources at a per-device granularity, i.e., one or

1

Resource Waste
(No Abstraction)

Static
resource
allocation

Applications

Applications

Combine

Interconnection
FPGA Cluster
Applications

(b)

Resource Waste
(Internal
Fragmentation)

Offline

(a)

Slots

Manual
Partition

Interconnection
Dynamic resource
allocation

OR

(c)

Static
resource
allocation

Resource Waste
(No multiFPGA support)

Runtime

Interconnection

Interconnection

FPGA Cluster

FPGA Cluster

Figure 1.1: (a) A conceptual diagram to illustrate the management method used in existing
FPGA clouds. It only supports a static resource allocation due to the lack of an abstraction,
leading to an inefficient resource utilization. (b) Several works [13][21] (including the lowlatency mode of AmorphOS [67]) abstract FPGAs into a pool of slots to enable a dynamic
resource allocation, thereby improving the resource utilization. However, this improvement
could be limited due to the internal fragmentation issue. Moreover, users need to manually
partition applications and handle the inter-slot communication if they cannot fit into one
slot. (c) The high-throughput mode of AmorphOS enables FPGA sharing by combining
multiple applications during the compilation process. However, this method does not decouple compilation and resource allocation, and neither makes full use of FPGA resources
due to the lack of multi-FPGA support.

2

LUT

DFF

BRAM

Logic
Cells

DSP

Memory

DSP

100

Normalized Capacity

Normalized Utilization (%)

30

20

10

0

(a)

10

1

(b)

90

28
65
40
Technology Node (nm)

16

Figure 1.2: (a) The amount of resource used by several representative FPGA applications (C-LSTM [129], DeltaRNN [43], BNN 1 [163], BNN 2 [83], FPGP [31], GraVF [37],
GraphOps [100], ForeGraph [32]). The results are normalized to the capacity of Xilinx
VU13P FPGA. (b) The FPGA capacity keeps growing due to technology advances.
multiple physical FPGA devices are exclusively allocated to one application (Figure 1.1a).
This coarse-grained management strategy leads to an inefficient utilization of the FPGA
resources due to the mismatch between the amount of resources required by applications and
the capacity of FPGAs, i.e., applications cannot fully utilize the allocated FPGA resources
(Figure 1.1a). This mismatch comes from the diversity in both applications and the FPGA
devices. Specifically, the increasingly diverse applications that are deployed in the cloud
require different amount of resources (Figure 1.2a). The on-demand computing model also
allows users to deploy distinct FPGA accelerators (even for the same application) to account
for the varying demands on performance and cost, which further increases the diversity in
resource usage. Moreover, benefiting from the advanced technology node and packaging
(e.g., multi-die packaging [111]), the capacity of FPGAs keeps growing (Figure 1.2b). On
the one hand, this increases the diversity in FPGA capacity due to the hardware rolling
upgrade. On the other hand, this inevitably leads to a mismatch between the resource usage
of legacy FPGA accelerators and the capacity of the latest FPGAs.
The root cause of the inefficient resource utilization is the lack of virtualization support.
Specifically, due to the lack of an abstraction, the FPGA compilation results are tightly
coupled with the physical resource allocation (Section 2.2) and applications need to be
3

recompiled in case of any physical resource change (either capacity or location). Unlike the
compilation process for CPUs, the FPGA compilation process has a high time complexity
and may take hours or even days depending on the complexity of applications. Thus,
although runtime recompilation can improve the resource utilization, it incurs a prohibitive
runtime overhead that limits the physical resource allocation to the offline compile time.
The inability to dynamically respond to the actual load and resource availability at runtime
leads to the inefficient utilization of FPGA resources.
This dissertation describes a virtualization support for cloud FPGAs to efficiently improve the resource utilization with a marginal virtualization overhead. We note that the
virtualization support developed for CPUs cannot be trivially applied to FPGAs due to
the fundamental difference between their architecture and computing models. Specifically,
an FPGA application describes the physical hardware circuits wired together under spatial
resource constraints, while a CPU application is a sequence of pre-defined instructions executing in the temporal domain. Thus, this dissertation presents a new system abstraction
tailored to the spatial reconfigurable architecture. A compilation framework is also provided
to compile applications onto this system abstraction. The existing FPGA commercial tools
are maximally reused in this compilation framework to minimize the development effort
and ensure the compilation quality.

1.3

Target Service Model

Different service models require distinct virtualization support. Thus, it is necessary to
determine the target service model before exploring the virtualization support for FPGAs.
Following the definition of the service model provided by the CPU-based clouds, we broadly
define three service models for the FPGA cloud. The first one is the Infrastructure-as-aService (IaaS) model. Under this service model, only the I/O interface of FPGA devices is
virtualized through an shell, while reconfigurable resources (e.g., lookup tables) are not virtualized and users can directly manage these physical reconfigurable resources (Figure 1.3).
The virtualization support proposed in [17][70][106][119][162] is developed for this service

4

Infrastructure-as-a-Service
(IaaS)

Platform-as-a-Service
(PaaS)

Software-as-a-Service
(SaaS)

Data

Data

Data

Application

Application

Application

Resource
Management

Resource
Management

Resource
Management

Compilation

Compilation

Compilation

System Abstraction

System Abstraction

FPGAs

FPGAs

FPGAs

Networking

Networking

Networking

Shell

Managed by the virtualization methods

Controlled by users

Figure 1.3: A conceptual diagram illustrates the service models defined for cloud FPGAs
model. The second one is the Platform-as-a-Service (PaaS) model. Under this service
model, both the I/O interface and the reconfigurable resources are virtualized through a
system abstraction. Users can only request virtualized resources with no control on the
physical resources (Figure 1.3). The virtualization support proposed in [7][21][67][73][160]
is developed for this service model. The last one is the Software-as-a-Service (SaaS) model.
Under this service model, a set of application-specific accelerators are abstracted into predefined APIs, which are exposed to users (Figure 1.3). The virtualization support proposed
in [40][54][55][149] is developed for this service model. In this dissertation, we choose to
explore the virtualization support for the PaaS model, mainly because (1) the virtualization
support for the IaaS model is relatively simple and has been well studied in prior works,
and (2) the PaaS model provides a scalable platform for application developers to build
their own FPGA accelerators and can be easily extended to the SaaS model.

1.4

Prior Virtualization Solutions and Our Goals

The existing virtualization solutions for the PaaS model can be broadly categorized into two
groups: time-multiplexing and space-multiplexing. Time-multiplexing methods [33][18][123][9]
[66][120][80] share FPGA resources among multiple applications in the temporal domain

5

through context switching. These works typically use multi-context FPGAs to hide the
context switching overhead. Different from the commercial FPGA that is a single-context
architecture and can only store one context of configuration, multi-context FPGAs contain
multiple sets of configuration memories to store several contexts of configuration. When
one context of configuration is used for computation, a new context can be simultaneously
loaded to hide the configuration overhead. However, the additional configuration memories significantly reduce the amount of FPGA resources available to users, thus, multicontext FPGAs (such as Tabula [52]) are less popular than single-context FPGAs and
time-multiplexing is less attractive than the space-multiplexing. Space-multiplexing methods [21][160][73][7][13][38][72][71][125] (including the low-latency mode of AmorphOS [67])
abstract FPGA resources into a pool of slots (slot-based methods) and partition physical
FPGAs into regions, where one region is used to implement one slot (Figure 1.1b). This
method decouples the compilation and resource allocation, thereby enabling dynamic FPGA
sharing at the sub-FPGA granularity (Table 1.1). However, due to the lack of multi-slot
support, i.e., one application needs to be mapped into a single slot, these methods face a
dilemma when determining the capacity of the slot. A larger slot size increases the amount
of wasted resources due to internal resource fragmentation, while a smaller slot size increases
the burden on users, i.e., more applications need to be manually partitioned (Figure 1.1b).
Thus, the improvement in resource utilization obtained from these methods can be limited.
The virtualization solution provided in AmorphOS [67] achieves better FPGA sharing
than other space-multiplexing methods by providing two operating modes: a low-latency
mode and a high-throughput mode. The low-latency mode applies the slot-based method
and thus also faces the same dilemma. The high-throughput mode wraps multiple applications into a single application, which is then compiled onto a single physical FPGA to enable
fine-grained FPGA sharing (Figure 1.1c). While achieving better resource utilization than
the slot-based methods, this high-throughput mode may still suffer from the resource fragmentation issue due to the lack of multi-FPGA support. Moreover, this high-throughput
mode does not decouple the compilation and resource allocation. Thus, it needs to of-

6

Table 1.1: A comparison of prior virtualization support for FPGAs and our design goals.
FPGA

Multi-FPGA

Resource

Virtualization

Sharing

Support

Utilization

Overhead

Time-Multiplexing [33]

Support

No

Medium

High

Slot-based methods [21]

Support

No

Medium

Low

AmorphOS∓ [67]

Support

No

High

High

Multi-FPGA Framework [32]

No

Support

Medium

Low

SCORE [35]

Support†

Support

Medium

Low

Our Goal

Support

Support

High

Low

Method

∓

High-throughput mode

†

Hard to provide sufficient isolation

fline generate the compilation results for many combinations to support various resource
allocations at runtime. When one application changes, the compilation results of all combinations related to this application need to be regenerated. Thus, this method has a high
offline compilation overhead.
Several frameworks [104][32][44][23] are developed to deploy one application onto multiple FPGAs. These frameworks are not developed for the PaaS model but fall into the broad
context of FPGA virtualization. Some commercial evaluation platforms (e.g., Cadence Protium S1 [14]) also use multiple FPGAs for emulation. These frameworks/platforms provide
valuable experience in providing multi-FPGA support. However, they do not address the
tight coupling between compilation and resource allocation. Thus, the resource allocation
still needs to be performed at offline compile time and thus these frameworks cannot enable
dynamic FPGA sharing among multiple users at sub-FPGA granularity.
SCORE [35][15][16] is an early pioneer work that falls into the broad context of FPGA
virtualization before cloud computing became ubiquitous. It aims to reduce the FPGA
programming complexity and provide multi-FPGA support by providing a stream-oriented
compute model and a new abstraction. It partitions physical FPGAs into compute and
memory regions that are connected by an all-to-all network to support the proposed com-

7

pute model and abstraction. These regions are required to be identical to decouple the
compilation and resource allocation, thereby enabling dynamic FPGA sharing. While the
methodology presented in SCORE is inspiring, it mainly targets the single-user, singleapplication environment for embedded computing systems. Thus, it cannot be trivially
applied to cloud FPGAs. For instance, it is hard to provide a strong isolation across applications because of the shared all-to-all interconnection network, which also reduces the
amount of resources that are exposed to users.
As shown in Table 1.1, we have two design goals when developing the virtualization
support for cloud FPGAs. (1) Providing a system abstraction to decouple the compilation
and resource allocation, thereby enabling dynamic FPGA sharing at sub-FPGA granularity without incurring high compilation overhead. (2) Providing multi-FPGA support to
mitigate the resource fragmentation issue caused by the physical FPGA boundary.

1.5

Overview and Contributions

The rest of the dissertation is organized as follows. Chapter 2 provides the necessary
background information for exploring the FPGA virtualization, including the FPGA architecture and the compilation flow. With this background information, it then explains the
tight coupling between the compilation and resource allocation. This chapter also discusses
the FPGA integration methods and the characteristics of cloud instances.
Chapter 3 describes the system abstraction developed for cloud FPGAs. It starts with
a two-level system abstraction that is developed for the heterogeneous FPGA cluster. This
system abstraction not only decouples the compilation and resource allocation, but also
simultaneously satisfies the conflicting requirements of runtime management and offline
compilation. This two-level abstraction is then merged into a single-level one that is specialized for the homogeneous FPGA cluster. Moreover, this generic two-level system abstraction can also be extended to support the SaaS model. This chapter presents a case
study that extends this abstraction to support application-specific ISA-based accelerators
[40]. The corresponding compilation frameworks that map applications onto the proposed

8

system abstraction are presented in Chapter 4. These compilation frameworks maximally
reuse existing FPGA compilation tools to minimize the development efforts and ensure
the compilation quality. Custom tools are developed for the new steps that are not supported by the conventional FPGA compilation tool. Chapter 5 presents the scheduling
and resource management policy. A heuristic method is presented to efficiently reduce the
resource fragmentation.
Chapter 6 presents a RRAM-based reconfigurable architecture, namely Liquid Silicon,
and uses it as a case study to show that the proposed virtualization solution can be extended
to other spatial reconfigurable architectures. This chapter first describes the key building
blocks of Liquid Silicon that extend the configuration capability of existing FPGAs from
computation to the whole spectrum ranging from computation to data storage. It then
presents a compilation framework that is developed to fully exploit the unique programmability provided by Liquid Silicon. Finally, the proposed system abstraction for FPGAs and
the Liquid Silicon architecture are co-optimized to apply the proposed virtualization solution onto Liquid Silicon. Chapter 7 highlights the main points of our work and concludes
this dissertation.
In particular, we made the following major contributions in this dissertation:
• A new system abstraction and a compilation framework are developed to virtualize
cloud FPGAs. This virtualization solution improves the overall resource utilization
by enabling a dynamic FPGA sharing at sub-FPGA granularity and reduces the
compilation time. It can also be extended to better support a homogeneous FPGA
cluster and the SaaS model. This part of work is published in [154][156][155].
• A RRAM-based reconfigurable architecture, namely Liquid Silicon, is developed to
address the limitations of existing FPGAs. (1) It provides a flexible resource provisioning between computation and storage that can be controlled by users to better
match applications’ requirements, while FPGAs have a fixed resource provisioning
that is determined by vendors. (2) It supports a coarse-grained logic implementation that effectively reduces the routing overhead. This part of work is published in
9

[152][153][150][151][158][157].
• We use Liquid Silicon as a case study to show that the proposed virtualization support
can be extended to other spatial reconfigurable architectures.

10

Chapter 2
FPGA Background
2.1

FPGA Architecture

A simplified view of FPGA architecture that is frequently cited in textbooks or publicly
available tutorials is drawn in Figure 2.1, i.e., an island-style heterogeneous architecture that
comprises a 2D array of configurable logic blocks (CLBs), switch blocks (SBs), connection
blocks (CBs) and hard IP blocks. Specifically, CLBs contain several look-up tables (LUTs)
and each LUT stores a truth table to implement an arbitrary 6-input logic function or two
logic functions with 5 or less inputs. The output of LUTs can be optionally connected to
a flip-flop to implement sequential circuits. CLBs also comprise hardened multiplexers and
a carry chain to efficiently implement complex circuits with more than 6 inputs, such as
adder [141]. Besides implementing logic functions, CLBs can also be configured as storage
elements in modern FPGAs. SBs and CBs form an extensive bit-wise network to route the
interconnections between CLBs and hard IP blocks. Hard IP blocks are scarce hardware
resources that are included for augmenting the capability of FPGAs in performing specific
functions. For instance, block RAMs (BRAMs) are used for on-chip data storage and DSPs
are used for arithmetic computations.
Inherited from multiple product generations, contemporary commercial FPGAs have
more complex architectural features that are not included in the simplified view. It is necessary to understand these additional features when developing the virtualization support.
In this chapter, we use the widely used UltraScale FPGA from Xilinx as an example to
11

Inputs

Outputs
I0
O6
I1
I2
LUT
I3
I4
O5
I5
D Q
D Q

I0
O6
I1
I2
LUT
I3
I4
O5
I5
I0
O6
I1
I2
I3 LUT
I4
O5
I5

Configurable
Logic Block

Switch Block

Connection Block

Hard IP Block
(e.g., BRAM)

F7MUX

D Q

Routing Channel

Figure 2.1: A conceptual diagram of the island-style FPGA architecture. The key building
blocks of one CLB in Xilinx UltraScle FPGA is drawn as an example. Note that the
commercial-grade FPGA architecture introduces additional features that are not drawn in
this diagram for simplicity.

12

Clock Distribution Network
in Prior FPGA

Clock Root

Clock Distribution Network
in UltraScale FPGA

Clock Region

Clock Skew

Figure 2.2: A conceptual diagram to illustrate the clock distribution network in the prior
FPGAs (left) and the UltraScale FPGA (right).
explain these additional architectural features.
Clock Region. Instead of having a single clock distribution network for the entire
FPGA, UltraScale FPGAs comprise a 2D array of clock regions and each region has its own
clock distribution network, as illustrated in Figure 2.2. These clock regions are independent
from each other, i.e., (1) the clock distribution network in one clock region can be turned off
without affecting the distribution network in other clock regions, thereby reducing the power
consumption, and (2) different clock signals can be distributed in different clock regions to
further increase the programmability of FPGAs. While the clock region design can reduce
the clock skew as the scale of the clock distribution network is reduced (Figure 2.2), this
clock skew still needs to be considered when developing the virtualization support.
Multi-die Package. UltraScale FPGAs comprise multiple dies in a single package
to increase the capacity. To route the cross-die interconnections, (1) super long lines are
included between every two dies [140], and (2) CLBs at pre-defined locations close to the

13

One Clock
Region High
FPGA Die
Super Long Line
Configurable
Logic Block

FPGA Die

Laguna

Figure 2.3: A conceptual diagram illustrates the additional architectural features to support
the multi-die package. The routing fabric and hard IP blocks are not drawn for simplicity.
boundary of dies are replaced by the Laguna cells [141], as illustrated in Figure 2.3. This
leads to additional heterogeneity in the FPGA architecture, i.e., intra-die routing vs interdie routing and CLBs vs Laguna, which needs to be carefully handled when developing the
virtualization support.

2.2

FPGA Compilation Flow

FPGA compilation is a process that directly maps the data flow of applications onto the
physical hardware. As shown in Figure 2.4a, the FPGA compilation process can be divided
into two stages. The first stage is the front-end that comprises a high-level synthesis tool to
convert applications written in high-level programming languages into Verilog RTL code.
The second stage is the back-end that contains three steps to process the Verilog RTL
code. The first step contains a parser to synthesize the Verilog RTL code into different
levels of intermediate representation, including control data-flow graphs, data-flow graphs
and a netlist of primitives (e.g., logic gates and hard IP blocks). The second step is the
technology mapping that maps logic gates in the netlist into LUTs and flip-flops. The last

14

Application

Application

Control

TensorFlow,
OpenCL…
High-Level
Synthesis
Verilog

Compute

Storage

Front-end

(a)

Parser

(b)
Allocated FPGA resources

Allocated FPGA resources

Primitives
Technology
Mapping
LUTs,
DFFs…

Back-end

Physical
Mapping

Bitstreams

Unused CLB

Unused BRAM

Unused DSP

Used CLB

Used BRAM

Used DSP

Routed
Interconnections

Figure 2.4: (a) A typical FPGA compilation flow that comprises a front-end and a backend. (b) A conceptual diagram to illustrate the tight coupling between compilation results
and resource allocation. Specifically, different spatial resource constraints of the allocated
FPGA resource lead to distinct compilation results.
step performs physical optimization, including clustering/packing, placement and routing.
This step is is NP-hard [135] with a high timing complexity and can take up to several
hours or even days, since it needs to place up to millions of primitives onto the physical
hardware and route numerous interconnections between them. As illustrated in Figure 2.4b,
the placement and routing are performed under specific spatial resource constraints, i.e.,
the layout of the reconfigurable resources, and different spatial resource constraints lead
to distinct mapping results. Thus, the compilation results are tightly coupled with the
allocated resource.

2.3

Partial Reconfiguration

Partial reconfiguration [64] is one key technology that enables an efficient FPGA-sharing
in the spatial domain. With this technology, a sub-region of one FPGA device can be declared as a partial reconfigurable region. Then this sub-region can be reconfigured to run
15

different applications without affecting the applications running on the remaining part of
the same FPGA device. For Xilinx FPGAs, there are two constraints when creating multiple partial reconfigurable regions on one FPGA device. (1) These partial reconfigurable
regions cannot be overlapped with each other. (2) One column of reconfigurable resources
(e.g., CLB, DSP and BRAM) in one clock region cannot be split into multiple partial reconfigurable regions, as illustrated in Figure 2.5. The second constraint comes from the
organization of the underlying configuration memory in Xilinx FPGAs [143]. Specifically,
the configuration memory is organized into an array of configuration frames, which are the
smallest addressable segments and are one element (e.g., CLB, BRAM and DSP) wide by
one clock region high. The entire content in one configuration frame will be modified during
the partial reconfiguration process, thus, it cannot be included in multiple partial reconfigurable regions. While Intel and Xilinx FPGAs have a similar underlying organization of
configuration memory, Intel FPGA provides an additional two-pass configuration method
(AND/OR mode) to allow the sharing of one column of reconfigurable resources among
multiple partial reconfigurable regions at the cost of increased configuration time and bitstream size [58]. Thus, Intel FPGAs do not have the second constraint when creating the
partial reconfigurable regions.

2.4

FPGA Integration

This section broadly discusses the integration methods of FPGAs (not limited to the
cloud environment). We focus on the integration methods that offload compute tasks to
FPGAs, while methods such as using FPGAs as the network switch for software-defined
network [144][146] are not included.
Tightly-attached: FPGAs can be tightly integrated with CPUs either in the same
package [29][56] or on the same board using a low-latency and cache-coherent interconnection (Figure 2.6a), such as Intel QuickPath Interconnect (QPI) [165][57][112]. Nevertheless,
such tight integration is not expected to be widely adopted in cloud, since it breaks the
homogeneity of computing modules and increases the complexity of design, deployment and

16

FPGA

Clock
Region

#2

#1

Partial
Reconfigurable
Region

#3

Figure 2.5: A conceptual diagram to illustrate the constraints when creating multiple partial
reconfigurable regions in Xilinx FPGAs. Specifically, the partial reconfigurable region #1
and #2 can co-exist on one FPGA device, while region #1 and #3 cannot be created on
the same FPGA device. The reconfigurable resources are not drawn for simplicity.
maintenance [130].
PCIe-attached: FPGAs can be implemented on a daughter-card and connected to the
host CPU through the high-speed point-to-point PCIe interconnection (Figure 2.6b). This
is a popular deployment option and has been used for other hardware accelerators such as
GPUs.
Network-attached: FPGAs can also be directly connected to the datacenter network
and communicate with CPU nodes through this network (Figure 2.6c). This reduces the
deployment and management complexity and has been used for deploying other hardware
accelerators such as Google TPU [47].
Existing commercial clouds typically adopt a hybrid method to integrate FPGAs. For
instance, Microsoft [40] and Amazon [4] attach one or multiple FPGAs to the host CPUs
using PCIe and deploy a secondary network for inter-FPGA communication (Figure 2.6d).
This is also the integration method targeted in this dissertation.

17

(b)

(c)

CPU

CPU

CPU
CPU
CPU

FPGA
Package

(d)
CPU
Nodes

PCIe

PCIe
FPGA
Node

CPU

Network
FPGA
FPGA
FPGA

FPGA
FPGA
Nodes

Node

Network

(a)

CPU
PCIe
FPGA
Node

Figure 2.6: Conceptual diagrams illustrate the popular integration methods for FPGAs,
which are (a) tightly integrated with CPU in the same package or on the same board, (b)
connected to CPU through PCIe, or (c) directly attached to the datacenter network. (d)
Commercial FPGA clouds [40][4] typically use a hybrid method, which is also the integration
method targeted in this dissertation.

2.5

Cloud Instance Characterization

Commercial cloud platforms allow users to request different cloud instances to account for
the varying demands on cost and performance. On-demand instances and spot instances
(or preemptible instances in Google cloud) are the two major types of instances in existing cloud [1][49]. The main difference is that on-demand instances cannot be interrupted
and have a higher priority for scheduling, while the spot instances can be interrupted by
the management system with a lower priority, thereby having a lower cost. These two
types of instances are also available for hardware accelerators such as GPU [3] and Google
TPU [48]. Although only on-demand instances are provided for FPGAs in existing commercial clouds [2], we expect both instances will be available when cloud FPGA resources
are virtualized, following the same trend as in other hardware accelerators. Thus, both
instances will be considered when designing the runtime scheduling policy (Section 5).

18

Chapter 3
System Abstraction for Cloud FPGAs
This chapter presents the system abstraction developed for the heterogeneous FPGA cluster that (1) serves as an intermediate layer between physical FPGAs and the compilation
framework (Chapter 4) to decouple the compilation and resource allocation, and (2) creates
a homogeneous resource pool for the runtime system to simplify the resource management
(Chapter 5). To better explore the design space, we first present the key design requirements identified from the compilation process, the runtime management and the FPGA
implementation (Section 3.1). Based on these design requirements, Section 3.2 describes a
two-level system abstraction that achieves both high flexibility and efficiency. While being
designed as application-independent to support PaaS model, this two-level system abstraction can also be extended to leverage application-specific information and support SaaS
model. Section 3.4 provides a case study that extends this two-level system abstraction to
support the application-specific ISA, a popular SaaS model for cloud FPGAs. Section 3.5
presents the evaluation results obtained from commercial FPGAs.

3.1

Design Requirements

We first identify two design requirements from the nature of the FPGA compilation process
and our design goals. Based on these two design requirements, the basic structure of the
system abstraction can be determined. With this basic structure, we further identify three
pairs of conflicting design requirements that are related to the tradeoff between flexibility
and efficiency.
19

Application

A certain amount of FPGA
resources organized in pre-defined
spatial resource constraints

Automatically
Partitioned
by Compiler

Interconnection
Network

CLB
Virtual
Blocks

BRAM

DSP

User
Logic

Figure 3.1: A conceptual diagram illustrated the basic structure of the system abstraction
and the mapping process.
The basic structure of the system abstraction for cloud FPGAs is drawn in Figure 3.1,
which is determined by two design requirements. The first design requirement comes from
the FPGA compilation process. As discussed in Section 2.2, the existing FPGA compilation process has a high timing complexity. Thus, this compilation process needs to be
performed offline to map applications onto the system abstraction in the virtualized environment. Moreover, since the FPGA compilation process (and the compilation process of
most spatial reconfigurable architectures) is performed under specific resource constraints,
the system abstraction needs to expose certain spatial resource constraints to the compilation tools. Thus, we can only abstract FPGA resources into virtual blocks, where each
virtual block comprises a certain amount of FPGA resources that are organized in predefined spatial resource constraints, as illustrated in Figure 3.1. This is consistent with the
design choice made in prior works, such as slot-based methods [21] and SCORE [35]. The
second design requirement comes from our design goal of maximizing the resource utilization. To avoid the dilemma in determining the capacity of one virtual block, we require that
20

the compilation framework needs to be able to partition applications into multiple virtual
blocks, as illustrated in Figure 3.1. Thus, the capacity of virtual blocks can be reduced
to minimize the resource waste caused by internal fragmentation without increasing the
burden on users. To support this mapping strategy, the system abstraction is required to
contain an interconnection network to connect these virtual blocks (Figure 3.1).
With the basic structure drawn in Figure 3.1, we further identify three pairs of conflicting
design requirements that are related to the tradeoff between flexibility and efficiency:
(1) Homogeneous or Heterogeneous System Abstraction
A homogeneous system abstraction with identical virtual blocks can simplify the runtime
management and provides portability across different FPGA clusters (flexibility), while a
heterogeneous system abstraction with different types of virtual blocks that are specialized
to each type of FPGA can improve the resource utilization (efficiency), as illustrated in Figure 3.2. Specifically, different spatial resource constraints lead to distinct mapping results,
as illustrated in Figure 2.4b. While it is possible to configure one FPGA using the compilation result generated from different spatial resource constraints, which has been explored
in prior works [11][89][59] to enable code portability across different types of FPGAs, this
strategy could incur a high resource waste due to the mismatch of the spatial resource constraints (requiring 40× ∼ 100× more FPGA resources), as reported in prior works [11][89].
Thus, it is preferred to provide multiple types of virtual blocks in the system abstraction,
one for each type of FPGA, to ensure the mapping quality. However, such a heterogeneous
system abstraction is coupled with the composition of the FPGA cluster, i.e., the system
abstraction needs to be modified when a new type of FPGA is added into the FPGA cluster.
Thus, it is hard to apply such a heterogeneous abstraction onto different FPGA clusters.
On the contrary, a homogeneous system abstraction can support different FPGA clusters
at the cost of low resource utilization.
(2) Asynchronous or Synchronous Interfaces
An asynchronous interface (or a latency-insensitive interface) for the communication
between virtual blocks enables a dynamic runtime deployment (flexibility), while a syn-

21

Interconnection Network
System Abstraction with
Identical Virtual Blocks

Portable

Heterogeneous FPGA Cluster

Heterogeneous FPGA Cluster

High Resource Utilization
Interconnection Network

Interconnection Network

System Abstraction with Specialized Virtual Blocks

FPGA
Type 1

Virtual block
specialized for
FPGA type 1

FPGA
Type 3

FPGA
Type 2

Virtual block
specialized for
FPGA type 3

Virtual block
specialized for
FPGA type 2

Generalized
virtual block

Figure 3.2: A conceptual diagram illustrates the conflicting requirements on the system abstraction. Specifically, a homogeneous system abstraction (top) provides portability across
different heterogeneous FPGA clusters, but has non-negligible resource waste due to the
mismatched spatial resource constraints. On the contrary, a heterogeneous abstraction with
specialized virtual blocks can achieve a high resource utilization at the cost of no portability.

22

Dynamic runtime
deployment

Asynchronous
Interface

Buffer

Contro
l Logic

Buffer

OR
FPGA

Virtual Blocks

Static runtime
deployment

Synchronous
Interface

FPGAs

User
Logic
FPGA

Virtual Blocks

Figure 3.3: A conceptual diagram to illustrate the difference between an asynchronous interface (top) and an synchronous interface (bottom). Specifically, the asynchronous interface
enables a dynamic runtime deployment at the cost of additional buffers and control logic.
On the contrary, the synchronous interface can be efficiently implemented but only support
a static deployment that is determined at offline compile time.
chronous interface improves the mapping quality (efficiency), as illustrated in Figure 3.3.
Specifically, as the asynchronous interface can hide the latency difference between on-chip
and off-chip interconnection networks, virtual blocks can be either deployed onto the same
FPGA device or different FPGA devices at runtime without incurring timing error. Moreover, by hiding the low-level latency, the asynchronous interface can also support different
inter-FPGA networks, no matter whether it has deterministic latency or not. Nevertheless,
since the on-chip interconnection network has a deterministic latency that can be resolved
at the offline compile time, a synchronous interface is sufficient for the on-chip communication, which requires less resources to implement compared with an asynchronous interface.
Moreover, a synchronous interface also exposes more low-level hardware details (i.e., the
maximum bandwidth provided by the interconnection network) to the compilation framework to improve the mapping quality (Section 4.1).
(3) All-to-all Network or Direct Interconnections
Using an all-to-all interconnection network increases the flexibility of the partition process when mapping applications into multiple virtual blocks (flexibility), while providing
direct interconnections between certain pairs of virtual blocks increases the resource utilization by reducing the amount of system-reserved resources (efficiency). Specifically, an
23

all-to-all interconnection network can support applications with a large Rent’s exponent [76]
and provide better support for large-scale applications. On the contrary, only providing direct interconnections between certain pairs of virtual blocks largely reduces the amount
of resources reserved for implementing the interconnection network, thereby increasing the
amount of resources available to users. It is also easier to isolate the inter-block communication of different applications (in terms of both performance and security) when using the
direct interconnection compared to using an all-to-all network.

3.2

Two-Level System Abstraction

This dissertation provides a new two-level system abstraction to decouple the aforementioned conflicting design requirements. Overall, the high-level abstraction provides a homogeneous view of the FPGA cluster to hide the heterogeneity across FPGAs, thereby
simplifying the runtime resource management and enables portability. On the contrary, the
low-level abstraction exposes the heterogeneous spatial resource constraints to the compilation framework to ensure the compilation quality. Moreover, the high-level abstraction
provides an all-to-all network with an asynchronous interface to support large-scale applications and a flexible runtime deployment, while the low-level abstraction organizes virtual
blocks into a 1D array and adopts direct interconnections with a synchronous interface
between adjacent virtual blocks to maximize the utilization of the on-chip interconnection
network and minimize the amount of system-reserved resources.
The high-level abstraction is designed to be FPGA-agnostic to hide as many hardware
details as possible. As depicted in Figure 3.4a, the high-level abstraction comprises a pool
of high-level virtual blocks (HL virtual blocks) that are connected by an all-to-all network
through the asynchronous (latency-insensitive) interface. This interconnection design not
only enables a flexible runtime deployment and supports various inter-FPGA networks, but
also provides an efficient support for compiling large-scale applications that have a high
Rent’s exponent. To abstract away the heterogeneity across FPGAs, the capacity and
the spatial resource constraints of the HL virtual blocks can be arbitrarily chosen by the

24

compilation framework (Section 4.1.1). Consequently, one HL virtual block can be migrated
across different types of FPGAs at runtime. Each HL virtual block also contains interfaces
for peripherals to provide the necessary virtualization support.
The low-level abstraction is designed to be FPGA-specific and expose as much hardware
details as possible to the compilation framework. As illustrated in Figure 3.4b, it uses an array of identical low-level virtual blocks (LL virtual blocks) to virtualize the resources of one
FPGA and comprises multiple arrays to support different types of FPGAs. The number of
LL virtual blocks in one array is arbitrarily chosen by the compilation framework, while the
number of LL virtual block arrays is equal to the number of FPGA types in the cluster. The
capacity and spatial resource constraints of one LL virtual block is tailored to a specific type
of FPGA, as explained in Section 3.2.1. To expose more details on the interconnection network, the LL virtual blocks provide two inter-block communication interface (Figure 3.4b).
A latency-insensitive interface is provided to implement the latency-insensitive interface in
the high-level abstraction and a synchronous interface with a deterministic latency and a
pre-defined maximum bandwidth is provided for the communication between adjacent LL
virtual blocks in one array. By applying additional constraints on the virtual-to-physical
mapping (Section 3.2.2), the latency-insensitive interface in both high-level and low-level
abstraction is only used for the inter-FPGA communication, while the synchronous interface
in the low-level abstraction is only used for the intra-FPGA communication. This allows
the compilation framework to apply appropriate optimization goals for different interconnections to improve the mapping quality, i.e., minimizing the required bandwidth for the
inter-FPGA communication to reduce the burden on the off-chip interconnection network,
while maximizing the utilization of the on-chip interconnection network (i.e., FPGA routing
fabric). The virtual-to-physical mapping strategy also constrains that one array of LL virtual blocks is deployed into one FPGA device. This limits the scale of the LL virtual block
array, thereby using simple direct interconnections does not lead to a scalability concern.

25

(a)

All-to-all Network

High-Level
Abstraction

(b)

Latency-Insensitive Interface

Latency-Insensitive Interface

High-Level Virtual Block

High-Level Virtual Block

Interface to Peripherals

Interface to Peripherals
For FPGA Type 2

For FPGA Type 1

Low-Level
Virtual Block

Low-Level
Virtual
Block

Low-Level
Abstraction

Low-Level
Virtual Block
Low-Level
Virtual
Block
BRAM DSP

BRAM DSP

CLB

CLB

Latency-insensitive Interface

Synchronous Interface

Interface to
Peripherals

Figure 3.4: A conceptual diagram illustrates the two-level system abstraction for a heterogeneous FPGA cluster. (a) The high-level abstraction comprises a pool of high-level
virtual blocks (HL virtual blocks) that are connected by an all-to-all network. An asynchronous interface is provided for the inter-block communication. One HL virtual block has
no spatial resource constraint to hide the heterogeneity across FPGAs. (b) The low-level
abstraction comprises multiple arrays of low-level virtual blocks (LL virtual blocks), where
one array abstracts one type of FPGA. One LL virtual block contains a certain amount of
reconfigurable resources that are organized in pre-defined spatial resource constraints. A
synchronous interface is provided for the intra-array communication, while an asynchronous
interface is also provided to implement the asynchronous interface in the high-level abstraction.

26

3.2.1

FPGA Overlay

An FPGA overlay is created to support the proposed system abstraction. In this section,
we use Xilinx FPGAs as an example to illustrate this overlay, which can also be created for
Intel FPGAs using the same strategy. Specifically, one physical FPGA device is partitioned
into two regions to support the proposed abstraction, i.e., the Service Region and User
Region, as illustrated in Figure 3.5. The Service Region is reserved by the system
and is not exposed to users. It contains dedicated modules to realize the virtualization
support for the peripheral devices attached to the physical FPGAs, such as the on-board
DRAM (Figure 3.5). The User Region is further divided into a group of physical blocks.
The LL virtual blocks are deployed into these physical blocks at runtime. These physical
blocks are created to be identical, so that one LL virtual block can be relocated into an
arbitrary physical block at runtime without recompilation to minimize the compilation
cost. As illustrated in Figure 2.1, the existing FPGAs have a column-based architecture
comprising multiple columns where each column contains the same type of resources. Thus,
we can partition the User Region in the row direction to preserve the periodicity in the
architecture and create identical physical blocks (Figure 3.5). As discussed in Section 2.3,
we cannot create two physical blocks in one clock region due to the constraint from the
organization of the underlying configuration memories. Thus, the height of the physical
blocks should be equal to that of one or multiple clock regions to avoid resource waste.
This requirement also ensures that the clock skew is not changed when relocating LL virtual
blocks across physical blocks, since the routing of clock signals are not changed. Then the
height of the physical block is set to the minimal value, i.e., the height of one clock region, to
minimize the capacity of one physical block, thereby reducing the resource waste caused by
the internal fragmentation. The additional heterogeneity caused by the multi-die package
is handled by the virtual-to-physical mapping, as explained in Section 3.2.2.

27

DRAM
Physical
Address

Switch

Service
Region

Controller

User Region
Address
Translation

Physical
Block

Monitor
Virtual
Address

Address
Translation

Physical
Block

Monitor

FPGA

Physical
Block

Figure 3.5: The physical FPGA is divided into Service Region and User Region to support
the two-level system abstraction. The virtualization support for on-board DRAM is drawn
as an example. Note that the actual layout of these regions is tailored to a specific type of
FPGA.

FPGA
Type 1
Low-Level
Virtual Block

High-Level
Virtual Block

Low-Level
Virtual Block

Offline

Runtime
Low-Level
Virtual
Block

Latency-insensitive Interface

FPGA
Type 2

Low-Level
Virtual
Block

Synchronous Interface

Interface to Peripherals

Figure 3.6: A conceptual diagram illustrates the virtual-to-physical mapping, where one
HL virtual block is offline mapped into an array of LL virtual blocks and then deployed
into one FPGA of the corresponding type at runtime. Multiple mapping results are offline
generated for one HL virtual block so that it can be deployed into different types of FPGAs
at runtime.

28

3.2.2

Virtual-to-Physical Mapping

The virtual-to-physical mapping strategy is illustrated in Figure 3.6. Specifically, one HL
virtual block is offline mapped into an array of LL virtual blocks and then deployed into
one physical FPGA at runtime. To support a flexible runtime deployment, one HL virtual
block is offline mapped onto all feasible LL virtual block arrays. One LL virtual block array
is feasible if it provides all the resources required by the HL virtual block. Consequently,
one HL virtual block can have multiple mapping results, and the runtime system selects the
appropriate mapping result to deploy one HL virtual block into the corresponding type of
FPGA at runtime (Figure 3.6). In principle, multiple HL virtual blocks of one application
can be deployed onto the same FPGA device. In this case, the asynchronous latencyinsensitive interface in the high-level abstraction will be used for both on-chip and off-chip
communication, leading to an efficient utilization of the on-chip interconnection network.
Thus, we require that HL virtual blocks of one application cannot be deployed into the same
FPGA device. Thus, the asynchronous interface in both high-level and low-level abstraction
is only used for the inter-FPGA communication, while the synchronous interface in the lowlevel abstraction is only used for intra-FPGA communication. This allows the compilation
framework to apply different optimization goals for these two types of interconnections,
as described in Section 4.1. Note that HL virtual blocks of different applications can be
deployed onto the same FPGA device to improve the resource utilization.
One and only one LL virtual block (contains both user logic and the latency-insensitive
interface) can be deployed into one physical block at runtime. In principle, we can deploy
multiple LL virtual blocks into one physical block to maximize the utilization of this physical block. This strategy is not adopted mainly due to two reasons. At first, this requires
an additional checking process at runtime to avoid potential resource conflicts. On the
one hand, this additional checking process inevitably increases the runtime management
complexity. On the other hand, it is also hard to perform such a checking process as user
applications are encrypted in most FPGA clouds (e.g., AWS F1 [5]). Moreover, deploying
multiple LL virtual blocks into one physical block also leads to security concerns, such as
29

side channel attack [107][161], as applications from different users are not physically isolated. Thus, we choose to deploy at most one LL virtual block into one physical block to
simplify the runtime management and provide a strong isolation. An array of LL virtual
blocks is deployed into an array of physical blocks, where adjacent LL virtual blocks are
deployed into adjacent physical blocks, as illustrated in Figure 3.7. While this requirement
slightly reduces the runtime deployment flexibility, it brings two advantages. At first, it
guarantees that the inter-block timing keeps the same under different runtime deployments
especially when a synchronous interface is adopted for inter-block communication. Moreover, it simplifies the implementation of the interconnection network while still isolating the
inter-block communication of different applications.
In order to handle the additional heterogeneity caused by the multi-die package, the
low-level abstraction comprises multiple LL virtual block arrays for one type of FPGA to
account for the difference between intra-die and inter-die communication, as illustrated in
Figure 3.8a. One HL virtual block is mapped onto all these LL virtual block arrays to
support different runtime deployments. The number of required LL virtual block arrays is
equal to the number of physical blocks in one die. As vendors typically adopt small dies to
improve yield, the number of required LL virtual block arrays and the added compilation
cost is limited, e.g., 4 for XCVU37P FPGA. Moreover, one LL virtual block array could
be reused across a set of FPGAs to effectively amortize the compilation cost (Figure 3.8b).
This is because vendors reuse a large portion of one die design across a set of FPGAs1 to
minimize the design cost [139]. The major difference is the number of dies and the provided
I/O components (e.g., the high-speed transceivers), which does not change the low-level
abstraction.

3.2.3

Design Space Exploration

The aforementioned low-level abstraction only provides one type of LL virtual block for
one type of physical FPGA (if ignoring the heterogeneity caused by the multi-die package).
1

For instance, VU31P, VU33P, VU35P, VU37P, VU45P, VU47P and VU57P from Xilinx have a similar
die design.

30

Complex
inter-block
network
#3

Unconstrained
Mapping
Strategy

Constrained
Mapping
Strategy
Low-Level
Virtual Blocks

#1

#1
#2
#3

Application #1

#2

Simple
inter-block
network

#1

#1

#2

#2
#2

#1

#3

OR

OR
#1

#2

Application #2

#1

#1

#2
#1

#3
#2

#2
#3

#1
#2

Physical Blocks

Varying
inter-block
timing

Unchanged
inter-block
timing

Physical Blocks

Figure 3.7: A conceptual diagram illustrates the benefits of the constrained mapping strategy (right). Specifically, this strategy ensures that the inter-block timing does not change
under the dynamic runtime deployment. Moreover, this strategy also simplifies the interconnection network between physical blocks. On the contrary, the unconstrained mapping
strategy (left) has a varying inter-block timing under different runtime deployment. Moreover, a dedicated interconnection needs to be provided for each pair of physical blocks to
ensure that the inter-block connections are not shared between applications, leading to a
complex interconnection network.

31

Physical
FPGA
LL
Virtual
Block

LL
Virtual
Block

LL
Virtual
Block

Offline

High-Level
Virtual Block

(a)

Die
Physical
Block

Runtime

LL
Virtual
Block

LL
Virtual
Block

LL
Virtual
Block

LL
Virtual
Block

LL
Virtual
Block

LL
Virtual
Block

Deployed
LL virtual
block

OR

Multiple LL virtual block arrays for the same type of FPGA

(b)

Offline

High-Level
Virtual Block

Latencyinsensitive
Interface

LL
Virtual
Block

Interface to
Peripherals

Runtime
LL
Virtual
Block

LL
Virtual
Block

Synchronous
Interface for
Intra-Die
Communication

FPGA Type 1

FPGA Type 2

OR

Synchronous
Interface for
Inter-Die
Communication

Figure 3.8: (a) Multiple LL virtual block arrays with different combination of the synchronous interfaces are provided for one type of FPGA to account for the difference between
the inter-die and intra-die communication latency. (b) One LL virtual block array could be
shared among a set of FPGAs if these FPGAs reuses the same die design. This effectively
amortizes the compilation cost. The service region in physical FPGAs is not drawn for
simplicity.

32

Two types of LL
virtual blocks

Three types of LL
virtual blocks

User Region

User Region

Service
Region

Service
Region

FPGA

Physical
Block

Physical
Block

FPGA

Figure 3.9: A conceptual diagram illustrates that smaller physical blocks can be created if
multiple types of LL virtual blocks are provided for one type of physical FPGAs (ignoring
the heterogeneity caused by the multi-die package).
Then, the physical block needs to contain an entire row of FPGA resources, as described in
Section 3.2.1. The size of the physical blocks can be reduced by providing multiple types of
LL virtual blocks for one type of FPGA. This allows us to reduce the width of the physical
blocks, as illustrated in Figure 3.9. While reducing the resource waste caused by the internal
fragmentation, this increases the compilation overhead as one HL virtual block needs to be
mapped into all feasible LL virtual block arrays. To explore this tradeoff between resource
utilization and compilation overhead, we create a new low-level abstraction that provides
two types of LL virtual blocks for each type of FPGA (if ignoring the heterogeneity caused
by the multi-die package). Section 3.5.3 shows the FPGA implementation results, Section
4.4.1 presents the compilation cost of using such low-level abstraction, and Section 5.4.3
discusses the runtime performance.

3.3

Specialized to a Homogeneous Cluster

While the two-level system abstraction can be directly applied to a homogeneous FPGA
cluster by only providing one type of LL virtual blocks in the low-level abstraction (if ignoring the heterogeneity caused by the multi-die package), we further propose a specialized
abstraction for the homogeneous FPGA cluster with reduced compilation cost. Specifically,
we can merge the two-level system abstraction into a single-level one. As illustrated in Fig-

33

ure 3.10a, the LL virtual block with the pre-defined capacity and spatial resource constraints
is used in this abstraction. These virtual blocks are organized in a 1D array to simplify
the required interconnection network, while a latency-insensitive interface is applied for the
inter-block communication. By using the asynchronous interface, one compilation result
can support all possible runtime deployments (Figure 3.10b), thereby minimizing the compilation overhead. Nevertheless, this single-level system abstraction has two limitations.
(1) It is hard to support large-scale applications that have a large Rent’s exponent, and
(2) the asynchronous interface leads to an inefficient utilization of the on-chip interconnections, which reduces the mapping quality. Thus, it is preferred to apply this single-level
system abstraction for small-scale applications that do not have restrictive requirements
on the mapping quality. We note that enabled by the programmability of FPGAs, the
two-level and single-level system abstraction can co-exist in a homogeneous FPGA cluster
(Figure 3.10c) to balance the tradeoff between compilation quality and compilation cost.
While using the latency-insensitive interface for inter-block communication reduces the
compilation cost, it also reduces the amount of resources exposed to users, since the implementation of a latency-insensitive interface requires more FPGA resources than that of
a synchronous interface due to the additional data buffers and control logic. To reduce
the amount of resources reserved by the system, we eliminate the buffers for the on-chip
inter-block interconnection by leveraging the fact that the on-chip communication has a
deterministic latency that can be resolved at the offline compile time. Then the compilation framework can generate the control logic that calculates the arrival time of input data
based on the specific communication latency and resumes the execution of user logic to
consume the input data when it arrives (Section 4.2). The implementation of the latencyinsensitive interface for the inter-FPGA communication still contains data buffers since the
inter-FPGA communication latency is non-deterministic.
With the above optimization technique, the latency-insensitive interface of one virtual
block has different implementations for intra-FPGA and inter-FPGA communication. If
the latency-insensitive interface is still combined with user logic and mapped into physi-

34

A 1D array of identical
virtual blocks

Virtual Blocks

#2

#1

Application #1

#3

#2

Application #2

BRAM DSP

One mapping
result support
various runtime
deployments

Interconnection
OR

#1

#2

#3

#3

CLB
Application #3

Interconnection

(b)
BRAM DSP

(c)

CLB

Managed by two-level
system abstraction

Managed by single-level
system abstraction

(a)
A homogeneous FPGA cluster

Latency-insensitive Interface
Interface to Peripherals

Figure 3.10: (a) A conceptual diagram illustrates the single-level system abstraction specialized for the homogeneous FPGA cluster, which comprises a 1D array of identical virtual
blocks. (b) This single-level system abstraction minimizes the compilation cost as one compilation result can be used for different runtime deployments. Only physical blocks are
drawn for simplicity. (c) The single-level system abstraction and two-level system abstraction effectively complement each other. Although they require different FPGA overlays,
they can co-exist in a homogeneous FPGA cluster enabled by the FPGA’s programmability.

35

cal blocks (the mapping strategy used in two-level system abstraction), then four mapping
results need to be generated for each virtual block, i.e., two latency-insensitive interfaces
in each virtual block (Figure 3.10a) and each interface has two possible implementations.
This increases the compilation cost by roughly 4×. To minimize the compilation cost, we
create an additional Communication Region to map the latency-insensitive interface, as
illustrated in Figure 3.11a. This region needs to be defined as partial reconfigurable regions, as the implementation of the latency-insensitive interface varies for different virtual
blocks. Moreover, this region is preferred to be placed between physical blocks to minimize the interconnection delay and maximize the number of supported interconnections
(Figure 3.11a). Nevertheless, this substantially reduces the number of physical blocks provided by one FPGA device due to the constraints in creating partial reconfigurable regions
(Section 2.3), as illustrated in Figure 3.11b. To address this issue, we restrict that only
the physical blocks on the top and bottom can access the inter-FPGA network, while the
physical blocks in the middle can only have intra-FPGA communication, as illustrated in
Figure 3.11c. The communication region for implementing intra-FPGA communication interface only needs to provide DFFs for timing isolation, and we can create such a region
for the worst case to support all virtual blocks. Then these communication regions can
be created as small static regions to minimize the resource waste, as illustrated in Figure 3.11c. The control logic for the intra-FPGA communication interface is merged into
the inter-FPGA communication interface, as described in Section 4.2.

3.4

Case Study: Extend to Support Application-Specific ISA

While being designed to be application-independent, this generic two-level system abstraction can be easily extended to leverage application-specific information and support SaaS
model. In this section, we use the application-specific ISA (AS ISA) [40] as a case study to
show this extendability.
As illustrated in Figure 3.12a, an additional abstraction layer is added on top of the twolevel system abstraction. This additional abstraction layer comprises a pool of soft blocks

36

Physical
Block

One Clock Region High

Physical
Block

User Logic

Physical
Block

User Logic

Physical
Block

Communication
Region

Inter-Block
Interconnection

(b) (c)

Communication
Regions

Physical
Blocks

(a)

Buffers

User Logic

Communication
Region

Control
Logic

Control
Logic

Buffers

Service Region

User Logic

Communication Region
(For intra-FPGA communication)
Buffers

Control Logic

Transceiver

To other
FPGAs

FPGA

FPGA

Communication Region
(For inter-FPGA communication)

Figure 3.11: A Communication Region is included to implement the latency-insensitive
interface. (a) As the width of the physical block is larger than its height, placing the communication region between two physical blocks reduces the interconnection length and supports
more interconnections compared with placing the communication region on left/right side
of physical blocks. (b) The communication region needs to created as partial reconfigurable regions to support various latency-insensitive interface. This substantially reduces
the number of physical blocks provided by one FPGA due to the constraint in creating partial reconfigurable regions. (c) Thus, we only create communication regions that support
inter-FPGA communications for the physical blocks on the top and bottom. The communication regions in the middle only support intra-FPGA communication and are created as
static regions. Note that the actual layout of these regions is tailored to a specific type of
FPGA.

37

ApplicationSpecific
Abstraction

LatencyInsensitive
Interface

Top-Level Abstraction

ApplicationIndependent
Abstraction

High-Level Abstraction

Variable spatial
resource constraints

Low-Level Abstraction

Variable spatial
resource constraints

(a)

Variable spatial
resource constraints

Data Parallelism
Soft
Block

A pool of
soft blocks

Child
Block

Child
Block

Parent Block

+
+

+

+
+

+

Three-Level
Adder Tree

+

(b) (c)
Pipeline Parallelism

Pipeline Parallelism
+

Child
Block

Data Parallelism

Child
Block
+

Parent Block

+

+

+

+

+

+

An RTL module that implements an adder

Figure 3.12: (a) An application-specific abstraction layer can be added on top of the twolevel system abstraction to support application-specific ISA. This additional layer comprises
a pool of soft blocks. Same as the high-level virtual blocks, these soft blocks also have
variable spatial resource constraints to simplify the partition process. (b) The soft block
has a multi-level tree structure, where one soft block can have an arbitrary number of child
blocks that are connected either in the data parallelism or pipeline parallelism. (c) These
two primitive parallel patterns are sufficient to construct other complex patterns, such as
the adder tree.

38

and each soft block provides an asynchronous interface for the inter-block communication.
In comparison to the high-level and low-level abstraction that adopts a single-level structure,
this abstraction layer adopts a multi-level tree structure to represent the application-specific
parallel patterns extracted from AS ISA-based accelerators. Specifically, a leaf soft block
contains a basic module, where the basic module is defined as a Verilog module that does not
instantiate other Verilog modules. A non-leaf soft block can have an arbitrary number of soft
blocks as its child blocks, and these child blocks are connected by one of the two primitive
parallel patterns, i.e., the data parallelism and pipeline parallelism (Figure 3.12b). These
two parallel patterns are selected because they are sufficient to construct other complex
parallel patterns [105], e.g., the reduction pattern is drawn in Figure 3.12c. Similar to the
HL virtual block, the capacity and the spatial resource constraints of the soft blocks are
arbitrarily chosen to simplify the compilation process. The extracted parallel patterns are
then leveraged to simplify the process of partitioning applications into HL virtual blocks, as
illustrated in Figure 3.13. A more detailed partition process is described in Section 4.3.2.
Soft
Block

High
Bandwidth
Requirement

Pipeline
Parallelism
#1

#2

#1

Low
Bandwidth
Requirement

#2

High-Level
Virtual Block

#3

#3
#1

#3

#2

High-Level
Virtual Block
Data
Parallelism

#1

#2

High-Level
Virtual Block

High-Level
Virtual Block

#3
#2

#1

#2

#1

#3

#3

Figure 3.13: A conceptual diagram to illustrate that the extracted parallel patterns are
leveraged to simplify the mapping from the additional abstraction layer to the high-level
abstraction layer.

39

3.5

Results

The proposed system abstraction is implemented on a custom-built FPGA cluster that
has three Xilinx Virtex UltraScale+ FPGAs (XCVU37P) and one Xilinx UltraScale FPGA
(XCKU115). These four FPGAs are attached to the host machine through PCIe, and a secondary bidirectional ring network is deployed to connect these FPGAs. Specifically, Xilinx
XCVU37P is a large and latest FPGA device fabricated in the 14/16nm technology node.
One FPGA board provides four 1 × 4 ganged 28Gb/s QSFP+ cages for 100Gb Ethernet
connection. Two DIMM sites are provided and each supports up to 128GB DDR4×72 with
ECC. XCKU115 is a relatively small and old FPGA device fabricated in the 20nm technology node. One FPGA board provides two QSFP28 cages for 40Gb/s Ethernet connection.
It also provides 12GB DDR4 memory with ECC and 4GB DDR4 memory without ECC.
Vivado 2020.1 is applied to generate the mapping results of the FPGA overlay (Section 3.2.1). Most results are directly obtained from Vivado, such as the capacity of one
physical block. A small benchmark is created to measure the inter-FPGA communication
bandwidth and delay. This benchmark comprises two building blocks that are mapped onto
two FPGAs. The building block A generates random data tokens that are sent to the building block B through the inter-FPGA network. The building block B then sends the data
token back to block A so that block A can measure the communication latency/bandwidth.
This strategy has also been used in prior works for measuring the performance of the interFPGA interconnection [119].

3.5.1

Two-Level System Abstraction

The FPGA overlay implemented on the XCVU37P FPGA is shown in Figure 3.14, which
contains two regions as discussed in Section 3.2.1. The sub-regions indexed with S belong to
the Service Region, while the sub-regions indexed with U belong to the User Region.
Specifically, the user region is further partitioned into 10 physical blocks (U-0 to U-9).
The capacity of each physical blocks is reported in Table 3.1. Note that the a small portion
of reconfigurable results in the middle of the FPGA device is not included in the physical

40

S-0
One
Clock
Region
High

User region

U-0

U-0
U-1
S-1

U-2

Service Region

S-4

U-3
S-0
Die
Boundary

S-0

U-4

High-Speed
Transceiver

S-1

U-5
U-6

Multiplexers that shares
inter-FPGA connection
between physical blocks

S-3

S-1
U-7
U-8
U-9
Two
Clock
Region
High

U-9

Physical Block 0 -9

S-2

DDR4 Controller

S-3

PCIe Controller

S-4
S-2

Multiplexers that shares
DRAM/PCIe Interface
between physical blocks

Figure 3.14: The commercial FPGA XCVU37P from Xilinx is partitioned into regions to
support the two-level system abstraction. User Region that is indexed with U is exposed
to users, while the Service Region that is indexed with S is reserved by the system.
The circuits in the system-reserved regions are pre-implemented and cannot be modified by
users. The mapping results are obtained from Vivado 2020.1.
Table 3.1: Resources provided by one physical block and the maximum communication
bandwidth provided by the intra-die and inter-die interconnections.
LUTs

DFFs

BRAM

DSPs

Intra-die

Inter-die

Two-Level

VU37P

86.88K

173.76K

4.64 Mb

696

Abstraction

KU115

58.08K

116.16K

6.75Mb

528

Native

Native

Single-Level

VU37P

83.52K

127.04K

4.64Mb

672

40K × f

13K × f

Abstraction

KU115

51.36K

72.72K

5.91Mb

528

30K × f

8K × f

43.48K

87.36K

2.53Mb

336

Native

Native

36.48K

72.96K

1.27Mb

312

24.96K

49.92K

2.95Mb

288

Native

Native

21.60K

43.20K

2.95Mb

192

VU37P
Two-Level
Abstraction†

KU115

∓

Native

∓

: Equal to the bandwidth provided by the underlying FPGA routing fabric.
: Two-level system abstraction with two types of LL virtual blocks.
f : Operating frequency of users’ applications.
†

41

Native

blocks. This is because the DDR4 controller (implemented in sub-region S-2) utilizes the
reconfigurable resources near the IO column in block U-9. Thus, in order to create identical
physical blocks, a small portion of reconfigurable resources near the central IO column is
excluded from all physical blocks. The DDR4 controller also utilizes the routing resources
in the physical block U-9. To avoid conflict on routing resources, one virtual block is first
mapped into the physical block U-9 and is then relocated into other physical blocks (e.g.,
U-5). By providing a synchronous interface for the communication between physical blocks,
the intra-die and inter-die communication latency/bandwidth provided by this overlay is
the same as that provided by the native FPGA routing fabric, thereby fully utilizing the
on-chip routing fabric. The Service Region contains the standard IP cores to share the
interface of DRAM, PCIe and multiplexing network. The sub-region S-0 implements the
IP core that utilizes the high-speed transceiver to provide the inter-FPGA communication
interface, which is shared between physical blocks by the multiplexers implemented in subregion S-1. The maximum bandwidth provided by this interface is 90Gb/s and the latency
is 52ns. A DDR4 controller is implemented in the sub-region S-2 and is shared between
physical blocks in a round-robin manner through the AXI Interconnect IP implemented in
the sub-region S-4. This sub-region S-4 also implements another AXI Interconnection IP
to share the PCIe module implemented in the sub-region S-3 among physical blocks in a
round-robin manner.
The FPGA overlay implemented on the XCKU115 FPGA is shown in Figure 3.15. Compared with that of XCVU37P, the major difference is that the User Region on XCKU115
only provides 8 physical blocks. The capacity of each physical block is presented in Table
3.1. The maximum inter-FPGA communication bandwidth is 36Gb/s and the latency is
40ns. This communication latency is slightly lower than that in XCVU37P. This is because
the height of XCKU115 (in terms of clock regions) is smaller than that of XCVU37P, thus,
the multiplexing network in the sub-region S-1 needs less pipeline registers to propagate
the signals.
Due to the requirement of creating identical physical blocks, we found that a non-

42

S-0
One
Clock
Region
High

User region

U-0

U-0

U-1

Service Region

U-2
S-1

S-0

U-3
Die
Boundary

High-Speed
Transceiver

S-1

U-4
S-0

U-7

Physical Block 0 -7

S-3

U-5

Multiplexers that shares
inter-FPGA connection
between physical blocks
S-2

U-6

DDR4 Controller
PCIe Controller

S-1
U-7
S-3
Two
Clock
Region
High

Multiplexers that shares
DRAM/PCIe Interface
between physical blocks

S-2

Figure 3.15: The commercial FPGA XCKU115 from Xilinx is partitioned into regions to
support the two-level system abstraction. User Region that is indexed with U is exposed
to users, while the Service Region that is indexed with S is reserved by the system.
The circuits in the system-reserved regions are pre-implemented and cannot be modified by
users. The mapping results are obtained from Vivado 2020.1.

43

negligible amount of FPGA resources are wasted, mainly in the sub-region S-2. Specifically,
the DDR4 controller implemented in S-2 runs at a high internal frequency, thus, its logic
needs to be placed close to the IO column. This leads to the unused resources in the left part
of the sub-region S-2. Moreover, the multiplexer networks implemented in sub-region S-1
and S-4 only utilize the LUTs and DFFs, while the DSPs and BRAMs in these sub-regions
are totally wasted (∼ 10% of the total DSPs/BRAMs). This issue can be largely alleviated
if the logic implemented in Service Region is replaced by dedicated hard IP blocks, such
as hardened DDR4 controller. We expect this will be realized in future FPGAs as the
function provided by Service Region requires limited reconfigurability and a hardened
memory controller is already provided by some types of FPGAs.

3.5.2

Single-Level System Abstraction

The implementation of the single-level system abstraction on the XCVU37P FPGA is shown
in Figure 3.16. The major differences between the implementation of the single-level system
abstraction and that of the two-level one are: (1) One XCVU37P FPGA can only provide
8 physical blocks due to the additional communication region (C-0 and C-1). (2) The
physical blocks in the middle (U-1 to U-6) has no access to the inter-FPGA network, thus,
the sub-region S-1 only needs to implement pipeline registers to propagate signals to C-1
instead of the multiplexers in the implementation of the two-level abstraction. (3) Pipeline
registers are included for the inter-die communication (between U-2/U-6 and U-3/U-7) to
isolate the timing of the inter-die connection from the intra-block timing. (4) Partition pins
are created for every physical block to assist the local placement. 40K partition pins (half of
them are input pins) are created for each physical block for the intra-die interconnections.
These partition pins are evenly distributed on the boundary of physical blocks (Figure 3.10).
In addition, 13K partition pins (half of them are input pins) are created for the inter-die
interconnections. These partition pins are placed close to the LAGUNA cells to simplify
the creation of this overlay at the cost of an increased compilation time. These partition
pins determine the upper bound of the intra-FPGA communication bandwidth, as reported
in Table 3.1. The latency of both inter-die and intra-die communication is one clock cycle.
44

S-0
One
Clock
Region
High

S-1

User region

C-0

Physical Block 0 - 7

U-1

Service Region
S-4

U-2
S-0
Die
Boundary

U-3
U-4
U-5

S-1

U-7

U-0

U-0

S-3

S-0

High-Speed
Transceiver

S-1

Pipeline
registers to C-1

S-2

DDR4 Controller

S-3

PCIe Controller

U-6
S-4
U-7

Multiplexers that shares
DRAM/PCIe Interface
between physical blocks

C-1
Two
Clock
Region
High

Communication region
S-2

C-0

C-1

Inter-FPGA communication
Partition pins

Figure 3.16: The commercial FPGA XCVU37P from Xilinx is partitioned into three regions to support the single-level system abstraction. User Region that is indexed with
U is exposed to users, while the Service Region that is indexed with S and the Communication Region that is indexed with C are reserved by the system. The circuits in
the system-reserved regions are pre-implemented and cannot be modified by users. The
partition pins are only drawn for illustration purpose, which are not the actual position.
The mapping results are obtained from Vivado 2020.1.
(5) Additional reconfigurable regions are created on the left and right side of the physical
blocks for propagating the control signals of the asynchronous interface for the intra-FPGA
communication (explained in Figure 4.8). These reconfigurable regions are not drawn in the
figure due to their narrow width but are reflected in Table 3.1, which reduce the capacity
of the physical blocks.
The implementation on XCKU115 FPGA is shown in Figure 3.17. Only 6 physical blocks
are provided by one FPGA due to the additional communication region. 30K partition
pins (half of them are input pins) are created for each physical block for the intra-die
interconnections. Moreover, 8K partition pins (half of them are input pins) are created for
the inter-die interconnections. The number of partition pins is lower than that in XCVU37P

45

User region
S-0
One
Clock
Region
High

C-0

Physical Block 0 - 5

U-0

Service Region

U-1
U-2
Die
Boundary

U-3
S-0

S-3

S-0

High-Speed
Transceiver

S-1

Pipeline
registers to C-1

S-2

DDR4 Controller
PCIe Controller

U-4
S-3

U-5

Multiplexers that shares
DRAM/PCIe Interface
between physical blocks

S-1
C-1
Two
Clock
Region
High

U-5

U-0

Communication region
C-0

S-2

C-1

Inter-FPGA communication
Partition pins

Figure 3.17: The commercial FPGA XCKU115 from Xilinx is partitioned into three regions
to support the single-level system abstraction. User Region that is indexed with U is
exposed to users, while the Service Region that is indexed with S and the Communication Region that is indexed with C are reserved by the system. The circuits in the
system-reserved regions are pre-implemented and cannot be modified by users. The partition pins are only drawn for illustration purpose, which are not the actual position. The
mapping results are obtained from Vivado 2020.1.
FPGA because the size of the physical block is relatively smaller than that in XCVU37P (in
terms of the number of resource columns). The capacity of one physical block is reported
in Table 3.1.

3.5.3

Creating Multiple Types of Physical Blocks

For the two-level system abstraction, we provide another implementation that contains two
types of physical blocks on one FPGA device. For the XCVU37P FPGA, this implementation is shown in Figure 3.18. Compared to the implementation in Figure 3.14, one physical
block is further partitioned into two smaller physical blocks (e.g., U-0 is partitioned into
U-0L and U-0R). An additional sub-region S-5 (belongs to Service Region) is created

46

User region
S-0
One
Clock
Region
High

S-1

S-0
Die
Boundary

U-0L
U-0L

U-0R

U-1L

U-1R

U-2L

U-2R

U-3L

U-3R

U-4L

U-4R

U-5L
U-6L

U-6R

U-7L

U-7R

U-8L

U-8R

S-1

U-9L
Two
Clock
Region
High

Physical Block 0 -9

Service Region
S-0
S-4

U-9R

S-2

High-Speed
Transceiver

S-1
Multiplexers that shares
inter-FPGA connection
between physical blocks

U-5R

S-5

U-9R

S-3

S-2

DDR4 Controller

S-3

PCIe Controller

S-4
Multiplexers that shares
DRAM/PCIe Interface
between physical blocks
S-5
Multiplexers that shares
DRAM/PCIe Interface and
inter-FPGA connection
between physical blocks

Figure 3.18: Two types of physical blocks are created on XCVU37P FPGA when implementing the two-level system abstraction. An additional sub-region S-5 is created to share
the DDR4/PCIe interface and the inter-FPGA interconnection among these smaller physical blocks. The mapping results are obtained from Vivado 2020.1.

47

to (1) provide the access to the DDR4/PCIe interface for the physical blocks on the left
and (2) provide the access to the inter-FPGA interconnection for the physical blocks on
the right. Although the smaller physical blocks could reduce the resource waste caused
by the internal fragmentation issue. The additional region reduces the amount of FPGA
resources provided by one FPGA device, which is about 20% for the scarce BRAM resources as reported in Table 3.1. Moreover, for the inter-FPGA communication, additional
bits need to be reserved in the packet for selecting the physical blocks. This reduces the
maximum inter-FPGA communication bandwidth to 84.4Gb/s. Thus, it is not obviously
that this implementation is better than the original one in Figure 3.14. Section 5.4.3 will
show that this implementation only achieves a negligible improvement in the aggregated
system throughput. This variant is also implemented on XCKU115 FPGA, as shown in
Figure 3.19.

3.5.4

Discussion

The total amount of FPGA resources exposed by these different system abstractions is
presented in Table 3.2. In the two-level system abstraction, the service regions occupy about
30% of the total FPGA resources. Nevertheless, the resource utilization of these service
regions is lower than 40%. Nevertheless, due to the constraint of creating identical physical
blocks, the unused FPGA resources in the service region cannot be allocated to create
additional physical blocks. One possible solution is providing hardened service regions.
For instance, if a hardened DDR4 controller is provided, then the FPGA resources in the
service region S-2 (occupy 15 ∼ 20% of the entire FPGA resources) can be used to create
additional physical blocks to increase the amount of resources exposed to users. In the
single-level system abstraction, additional communication regions (C-1 and C-2) further
reduce the amount of FPGA resources exposed to users. Similar to the service region, the
resource utilization in the communication region is also low (< 25%). However, due to
the constraint of partial reconfiguration (Section 2.3), the height of these communication
regions cannot be reduced, which leads to the waste of FPGA resources. One possible
solution is reducing the height of clock regions in the FPGA architecture to decrease the
48

User region
S-0
One
Clock
Region
High

U-0L

U-0R

U-1L

U-1R

U-0L

Physical Block 0 -7

Service Region
U-2L

U-2R

U-3L

U-3R

S-0

S-1

Die
Boundary

U-4L
S-0

S-4

U-4R

U-5L

U-5R

U-6L

U-6R

High-Speed
Transceiver

S-1
S-3

Multiplexers that shares
inter-FPGA connection
between physical blocks
S-2

DDR4 Controller
PCIe Controller

S-3

S-1
U-7R

U-7L
Two
Clock
Region
High

U-7R

Multiplexers that shares
DRAM/PCIe Interface
between physical blocks
S-4

S-2

Multiplexers that shares
DRAM/PCIe Interface and
inter-FPGA connection
between physical blocks

Figure 3.19: Two types of physical blocks are created on XCKU115 FPGA when implementing the two-level system abstraction. An additional sub-region S-4 is created to share
the DDR4/PCIe interface and the inter-FPGA interconnection among these smaller physical blocks. The mapping results are obtained from Vivado 2020.1.

49

Table 3.2: The amount of resources exposed to users.
DFFs

BRAM

DSPs

Two-Level

VU37P

868.8K (66.6%)

1.74M (66.6%)

46.4Mb (65.4%)

6960 (77.1%)

Abstraction

KU115

464.6K (70.0%)

929.3K (70.0%)

54.0Mb (71.1%)

4224 (76.5%)

Single-Level

VU37P

668.2K (51.2%)

1.02M (40.0%)

37.1Mb (52.3%)

5376 (59.6%)

Abstraction

KU115

308.2K (46.5%)

436.3K (32.9%)

35.5Mb (46.8%)

3168 (57.4%)

Two-Level

VU37P

799.6K (61.3%)

1.60M (61.4%)

38.0Mb (53.6%)

6480 (71.8%)

KU115

372.5K (56.2%)

745.0K (56.2%)

47.2Mb (62.2%)

3840 (69.6%)

†

Abstraction
†

LUTs

: Two-level system abstraction with two types of LL virtual blocks.

size of the communication regions. In the two-level system abstraction that provides two
types of LL virtual block for one type of FPGA, the additional multiplexers for sharing the
peripheral interfaces slightly reduce the amount of resources exposed to users. We note that
the circuits for sharing peripheral interfaces do not require superior programmability. Thus,
it might be beneficial to harden these circuits to further increase the amount of resources
exposed to users.

50

Chapter 4
Compilation Framework
This chapter describes a new compilation framework that can map applications onto the
proposed system abstraction (Chapter 3). The key design principle is maximally reuse the
existing FPGA compilation tools to (1) minimize the engineering efforts of developing this
compilation framework and (2) ensure the compilation quality to minimize the virtualization
overhead. Custom tools are developed for the unique steps that are not supported by the
conventional FPGA compilation tools. The following sections first describe the compilation
framework developed for the two-level system abstraction and then extend this compilation
framework to support the single-level system abstraction specialized for the homogeneous
FPGA cluster as well as the three-level system abstraction for application-specific ISA. In
this chapter, FPGA compilation tool from Xilinx (Vivado) is used to build a compilation
framework for Xilinx FPGAs, while the same strategy can be applied to build a compilation
framework for Intel FPGAs.

4.1

Compilation Framework for Two-Level Abstraction

This compilation framework comprises six steps: synthesis, high-level partition, low-level
partition, local routing, relocation and global place&route, as illustrated in Figure 4.1. The
steps that use custom tools are highlighted. These custom tools are either developed from
scratch or by leveraging the APIs provided by RapidWright [77][78]. The remaining steps
reuse the proprietary FPGA tools (Vivado in our implementation) to achieve a compilation
quality comparable to the conventional FPGA compilation flow.
51

High-Level Partition

Applications

High-Level
Synthesis

TensorFlow, OpenCL …

Partition

Verilog RTL

Floorplanning
Constraint File

Low-Level
Partition

Technology
Mapping

Monolithic Placement
(Commercial FPGA Tool)

Design
Check Point

Placement Splitting

Parser

Relocation
Design
Check
Point

Design Check Point

Local Routing
(Commercial FPGA Tool)

Latency-Insensitive
Interface Generation

Custom
Interface
Description
Global Place&Route
(Commercial FPGA Tool)

Bitstream

Figure 4.1: The compilation framework for the two-level system abstraction. The steps
using custom tools are highlighted in blue.
Step 1: Synthesis. This step reuses existing high-level synthesis tools to convert
applications written in high-level programming languages into Verilog RTL code. Different
high-level synthesis tools can be integrated into this compilation framework (extendability)
as long as the synthesis tools output Verilog RTL code.
Step 2: High-Level Partition. The step has two sub-steps to map the input RTL code
onto the high-level abstraction. The first sub-step uses a custom tool to partition the RTL
code into a given number of HL virtual blocks with the optimization goal of minimizing
the inter-block communication cost (in terms of the number of inter-block connections).
As the capacity of one HL virtual block can be arbitrarily chosen, this partition step is
performed with no hardware constraint to simplify the partition process. This custom
tool builds the dataflow graph (DFG) of the input RTL and uses simulated annealing
algorithm [127] (or min-cut algorithm [115]) to partition one application. This partition
process is performed at the granularity of Verilog modules, i.e., one node in the built DFG
is a module. This effectively prunes the search space with a negligible degradation in the
partition quality since inter-module communication bandwidth is typically much lower than
the intra-module communication bandwidth. A recursive method is applied in this partition
process, as described in Section 4.1.1.

52

To another HL
virtual block

To another HL
virtual block

Write
Enable

FIFO

FIFO

FIFO

Full
Write
Enable

Datapath
(2 cycles)

Data

From another HL
virtual block

Empty
NOR

Data

Adder
(1 cycle)
Valid

Clock
Enable

Standard DRAM Interface

Latency-Insensitive Interface

DFF

Figure 4.2: A conceptual diagram illustrates the latency-insensitive interface generated for
one HL virtual block.
The second sub-step uses a custom tool to generate the latency-insensitive interface
for each HL virtual block obtained from the partition process. Rather than transferring
the output signals of user logic in a cycle-by-cycle manner, the generated interface only
transfers the valid output data. This is achieved by leveraging the observation that most
FPGA applications use standard interfaces (e.g., AXI interface [75]) to fetch input data
from peripherals (e.g., DRAM) and these interfaces contain the data valid signal. This
custom tool then generates necessary logic to propagate this valid signal into the latencyinsensitive interface, so that the interface only buffers the valid output data, as illustrated
in Figure 4.2. Users can also provide a description of the custom interface used in the
application to leverage this optimization. If the modules in one HL virtual block does not
utilize such interfaces, then the generated latency-insensitive interface will transfer output
signals in a cycle-by-cycle manner to ensure the correctness. The output signals that share
the same valid signals are combined and buffered by the same FIFO to minimize the number

53

of required FIFOs and the overhead of control logic.
The latency-insensitive interface also needs to halt the execution of user logic when
the corresponding input FIFO is empty or the output FIFO is full. The key is to keep
the internal states of user logic unmodified when the execution is halted, such as the onchip memory, result registers in an accumulator and the state registers in FSMs. The
custom tool identifies the logic primitives that store the internal states, e.g., DFFs within a
feedback loop, and generates the control signal to their clock enable port2 , as illustrated in
Figure 4.2. When execution is halted, the states of these elements are not modified because
of the disabled clock. The custom tool also generates control signals to the write enable
port of on-chip memories to guarantee that the content of these memories is not modified.
Step 3: Low-Level Partition. This step has three sub-steps to place the user logic and
the latency-insensitive interface in one HL virtual block into an array of LL virtual blocks.
Instead of first partitioning one HL virtual block and then placing each partition into one
LL virtual block, we choose to reuse the commercial place&route tool to monolithically
place one HL virtual block onto a given region on the physical FPGA and then split the
placement result to generate the placement of each LL virtual block. The synchronous
interface between LL virtual blocks enables this flow. The size of the pre-defined region is
determined by the amount of resource required by the user logic in one HL virtual block.
This flow has three benefits compared to the alternative flow: (1) the placement of all
LL virtual blocks are jointly optimized in the monolithic mapping process, (2) the highlyoptimized commercial FPGA place&route tool ensures the quality of the placement result,
and (3) this flow can better utilize the direct interconnections between adjacent physical
blocks. The placement process of all possible regions can be fully parallelized to minimize
the compilation time (Figure 4.3). Moreover, since the monolithic placement process is the
same as the placement process in the conventional FPGA compilation flow, the techniques
proposed in prior works [51][136][138] that improve the placement quality and reduce the
compilation time could also be applied onto this step.
2
This process is performed at the netlist level in the actual implementation, but it is drawn as performed
before technology mapping in Figure 4.1 for simplicity.

54

High-Level
Virtual Block

Asynchronous
Interface

User
Logic
Floorplanning based
on resource usage

An allocated
region for
monolithic
placement

Physical FPGA

Physical FPGA

Physical FPGA

Physical FPGA

Physical FPGA

Physical FPGA

Physical FPGA

Physical FPGA

Parallel
Out-of-context
Placement

Split Placement

Placed
Low-Level
Virtual Block
Physical FPGA

Physical FPGA

Physical FPGA
Relocation

Physical FPGA

Physical FPGA

Physical FPGA

Figure 4.3: A conceptual diagram illustrates the process of mapping one high-level virtual
block onto physical FPGAs. The local routing step is not drawn in the figure for simplicity.

55

The first sub-step uses a custom tool to estimate the number of LL virtual blocks required
by one HL virtual block. It then generates the Vivado constraint file to specify a region on
the physical FPGA that comprises the given number of LL virtual blocks, as illustrated in
Figure 4.3. The commercial FPGA place&route tool is used to map the input HL virtual
block into the defined region using the out-of-context flow [142]. The third sub-step then
uses a custom tool that utilizes the APIs provided by RapidWright to split the monolithic
placement result. The key APIs used in this custom tool are (1) Cell.getSite().getName()
to obtain the placement result of one primitive and determine which LL virtual block it
belongs to, and (2) Design.createAndPlaceCell() to create and place one logic primitive
into the corresponding LL virtual block. This step also generates the partition pins for
the wires that are passing through blocks based on the placement results obtained from
the Monolithic Placement step. This process has a low timing complexity as it only needs
to read the placement result of each logic primitive (e.g., LUT6) and assign it into the
corresponding LL virtual block. Compared with the monolithic placement process, the
runtime of this step is negligible.
Step 4: Local Routing. This step reuses the commercial FPGA place&route tool
to perform the local routing for each LL virtual block. The local routing of all LL virtual
blocks can be performed in parallel to reduce the compilation time.
Step 5: Relocation. This step uses a custom tool that leverages the APIs provided
by RapidWright to relocate one mapped LL virtual block into other feasible physical blocks
without recompilation (Figure 4.3). The key APIs used in this custom tool are (1) Module.setAnchor() to generate relative placement and routing for the given anchor (a logic
primitive such as BRAM), and (2) Module.place() to generate a new placement and routing
result for a given new anchor.
Step 6: Global Place&Route. This step reuses the commercial FPGA tools to
integrate the individually mapped components into a complete design and generate the
partial reconfigurable bitstreams to support dynamic runtime management. This process
is not supported by the Vivado GUI and we develop a Tcl script to automate it.

56

Can be used to deploy this application into three FPGAs

Application

User
Logic
K Pairs

High-Level
Virtual Block

Can be used to
deploy this
application into
two FPGAs
N Iterations

Figure 4.4: One application is recursively partitioned into multiple HL virtual blocks.

4.1.1

Recursive Partition Process

We adopt a recursive partition method to map one application into HL virtual blocks. As
illustrated in Figure 4.4, one application is first mapped into a single HL virtual block,
which is then partitioned into two HL virtual blocks. This process is recursively performed
N times and totally generates 2N +1 − 1 HL virtual blocks. All these HL virtual blocks are
mapped onto LL virtual blocks using the flow illustrated in Figure 4.3 to support various
runtime deployments, such as deploying the application into a single FPGA or up to 2N
FPGAs, as illustrated in Figure 4.4. Each round can generate K different partition results
to further increase the runtime deployment flexibility. Overall, the number of HL virtual
blocks generated by this step is

#Blocks = 1 + K

N
X

2i = 2K(2N − 1) + 1 = O(K2N )

i=1

We need to judiciously determine the value of parameter N and K to balance the

57

K=1 One mapping result
with two HL virtual blocks

K=2 Two mapping results,
each has two HL virtual blocks
OR

Physical Block
occupied by other
applications

Interconnection
High
Same runtime
performance
(both can be deployed)

Interconnection
Resource Availability
K=2 is better

Interconnection
Low
Same runtime
performance
(Neither can be deployed)

Figure 4.5: A conceptual diagram to illustrate that improving K only leads to non-negligible
runtime performance improvement in limited scenarios.
compilation cost and the runtime deployment flexibility. For a specific application, it is
obviously that a larger K leads to better runtime performance. Nevertheless, this is not the
case when considering the entire system. Specifically, when the system has abundant FPGA
resources, then the runtime system can find an appropriate resource allocation to deploy one
application in most cases, even if this application only has one mapping result (K = 1), as
illustrated in Figure 4.5. On the contrary, when the system has a low resource availability
(high resource contention), no matter how many mapping results are generated for one
application, the runtime system might not be able to find a feasible resource allocation due
to the low resource availability, as illustrated in Figure 4.5. Thus, improving K might only
lead to a non-negligible runtime performance improvement in limited scenarios (Figure 4.5).
Based on this analysis, we choose K = 1 in our compilation framework. For the parameter
N , the mapping result that can deploy one application into multiple FPGAs (N ≥ 1) is
needed to alleviate the resource fragmentation caused by the boundary of physical FPGAs.
Moreover, the value of parameter N is also related to the size of the application, where
58

High-Level
Synthesis

Applications

Parser

Technology
Mapping

Netlist

Partition

TensorFlow, OpenCL …

Relocation

Design
Check
Point

Netlist

Local
Place&Route
(Commercial
FPGA Tool)

Netlist

Design
Check
Point

Global Place&Route
(Commercial FPGA Tool)

Latency-Insensitive
Interface Generation

Bitstream

Figure 4.6: The compilation framework for the single-level system abstraction. The steps
using custom tools are highlighted in blue.
a large application needs a large N to generate HL virtual blocks that are small enough.
Based on our design space exploration (Section 5.4.1), for applications that can fit into one
FPGA device (the majority of existing FPGA applications), N = 1 is sufficient to achieve a
high aggregated system performance. Thus, we choose N = 1 and K = 1 in our compilation
framework.

4.2

Compilation Framework for Single-Level Abstraction

The compilation framework developed for the single-level system abstraction is drawn in
Figure 4.6. Compared with the one for the two-level system abstraction (Figure 4.1), this
compilation framework merges the two partition steps (high-level and low-level partition)
into one step to map applications into virtual blocks. As the virtual blocks in the single-level
system abstraction have fixed capacity and resource constraints, the goal of this partition
step is minimizing the number of inter-block connections under the given resource constraints. Thus, we choose to perform this partition step at the netlist level, as this level
provides an accurate estimation on the low-level resource usages (e.g., number of LUTs and
BRAMs), which is difficult to be obtained in the level of control data-flow graphs and the
level of data-flow graphs. Simulated annealing algorithm is applied for this partition step.
This partition step also needs to optimize the location of partition pins to ensure the

59

Monolithic Placement for Two-Level Abstraction
p1

p2

p2

p1

Iteration 1

OR
Iteration 2
p3

p4

p5

p4

p5

p3

(a) (b)
Local Placement for Single-Level Abstraction
p1

p2

p3

p4

p1

p3

p4

p2

OR
p5

p6

p7

Partition pin

p8

p9

Iteration 3
p5

Logic Primitive

p7

p8

p6

p9

Routed
Interconnection

Partition pin

Logic Primitive

Interconnection

Figure 4.7: (a) A conceptual diagram to illustrate the the quality of the local placement
step for single-level system abstraction is more sensitive to the position of partition pins
compared with that of the monolithic placement step in two-level system abstraction. This
mainly because the local placement step has a smaller placement region (one physical block)
and more partition pins. (b) An iterative partition method is applied to obtain a fine-grained
partition results when mapping user logic into virtual blocks. The fine-grained partition
results are leveraged to determine the position of partition pins. In the drawn example,
the partition result obtained from the third iteration is used to determine the position of
partition pins.

60

quality of the local placement. Partition pins are used as virtual IO pins in the outof-context flow [142] to guide the placement process (Figure 4.7a). Compared with the
monolithic placement step in Figure 4.1, the local place&route step in this compilation
framework has a smaller placement region and more partition pins. Thus, the quality of the
local placement result is more sensitive to the location of the partition pins compared to that
of the monolithic placement step, as illustrated in Figure 4.7a. In order to ensure the quality
of the local placement, an iterative partition process is applied to optimize the position of
partition pins, as illustrated in Figure 4.7b. This partition tool places logic primitives into
given blocks with the goal of minimizing the total wire length and then a partition result
is generated based on the placement result. More specifically, user logic is partitioned into
virtual blocks in the first iteration. Then one virtual block is partitioned into two sub-blocks
and the user logic partitioned into this virtual block in the first iteration is partitioned into
these two sub-blocks. In the following iterations, one sub-block is further partitioned into
two sub-blocks and the user logic partitioned into this sub-block in the previous iteration
is partitioned into the newly generated sub-blocks. This fine-grained partition result is
then used to determine the location of partition pins. We use a parameter n to control
the number of iterations, and a large n leads to better placement of partition pins. In
Section 4.4.2, a design space exploration is performed to determine the value of n.
The latency-insensitive interface generation step is slightly different from that described
in Section 4.1, mainly because of the different implementations of the intra-FPGA communication interface. The implementation of the intra-FPGA communication interface is
illustrated in Figure 4.8. Specifically, when the inter-FPGA connection is generated for
one specific inter-block communication interface, then the implementation of the remaining communication interfaces (inter-FPGA or intra-FPGA) are all determined (Figure 4.8),
since only two physical blocks in one FPGA device provide the access to the inter-FPGA
network. Thus, the control logic for the intra-FPGA communication interface can be included in the implementation of the inter-FPGA communication interface, and additional
paths are generated to propagate the control signals (Figure 4.8). With this method, the

61

Implementation for inter-FPGA connection

Virtual Block #0

Input
Buffers

Virtual Block #1

Control Logic for
block #1 to #N

Virtual Block #2

Virtual Block #3

Implementation for inter-FPGA connection

Virtual Block #N

Output
Buffers

Virtual Block #(N+1)

Path for propagating control
signals and output buffer’s status

Implementation for intra-FPGA communication

Figure 4.8: When the inter-FPGA connection is implemented for one interface, then the
implementation of the remaining interfaces are all determined based on the number of
physical blocks provided by one FPGA. In the drawn example, one FPGA provides N
physical blocks. Then the control logic for the intermediate blocks can be merged and
implemented in the inter-FPGA connection. This figure only draws the implementation for
one dataflow (from top to bottom) for simplicity.
communication region for intra-FPGA communication does not need to reserve resources
for control logic, which minimizes the amount of resources reserved by the system.

4.3

Compilation Framework for Application-Specific ISA

Two additional steps are developed for the three-level system abstraction (Figure 3.12), i.e.,
a decomposing step to map applications onto the top-level abstraction layer that comprises
a pool of soft blocks (added before the partition step in Figure 4.1), and a partition step
to map applications from the top-level abstraction layer to the high-level abstraction layer
(replace the partition step in Figure 4.1).

4.3.1

Decomposing Step

A given AS ISA-based accelerator is decomposed onto the top-level abstraction layer by
extracting all fine-grained parallel patterns (data parallelism or pipeline parallelism). This
62

decomposing step is performed at the RTL level, as this allows us to provide an extendable
framework to support various high-level programming languages/frameworks [102][92][24][30]
[26][131]. To decompose a monolithic AS ISA-based accelerator, we first split the control
and data path at the top level of the design and map them into two separate soft blocks
(Figure 4.9a). This is feasible since AS ISA-based accelerators are FPGA-based soft processors with well-separated control and data path. Explicitly separating the control and
data path enables the optimization technique used for improving the runtime performance
(Section 4.3.2). We then recursively decompose the soft block that contains the data path
while keeping the soft block with control path unchanged. The soft block with the data
path can be decomposed either in a top-down flow or a bottom-up flow. In the top-down
flow, one soft block is decomposed into multiple child blocks based on one of the two primitive parallel patterns (Figure 4.9b). This decomposing process is recursively applied on the
newly generated soft block until it contains a basic module (a Verilog module that does not
instantiate other Verilog modules). Alternatively, in the bottom-up flow, we first extract
all basic modules contained in the data path and assign each of them into a leaf soft block
(Figure 4.9c). We then identify a cluster of soft blocks that are connected in one of the two
primitive patterns and create a parent soft block for them. This cluster is then replaced by
the created soft block and this process is recursively performed until there is no soft block
that can be clustered.
While experienced system designs might be able to manually decompose small AS ISAbased accelerators based on the aforementioned process by directly examining the source
code, this decomposing process may become more difficult and time-consuming for large
and complicated accelerators. Therefore, we develop a software tool to automate this decomposing process using the bottom-up flow due to the ease of implementation. Also, as
it is hard to automatically identify the control path from the RTL source code, we need
system designer’s assistance to mark the control path by providing the corresponding RTL
module name to the automation tool. We expect the required effort is relatively trivial
as these modules can be easily identified at the top level. For HLS-generated RTL code

63

AS ISA-based
accelerator

Soft Block

Separate control
and data path

Decompose
data path

Soft Block

Soft Block

Control
Path

Data
Path

Latencyinsensitive
Interface

Decompose
generated
child blocks

Data Path

(a)

Data Path

Data
Path

(b)
(c)

Soft Block
Data
Path

RTL Design

Create parent
block

Create leaf
blocks
Soft Block

Child blocks connected
in data parallelism

Child blocks connected
in pipeline parallelism

Figure 4.9: A conceptual diagram illustrates the decomposing flow, where (a) the control
and data path in one AS ISA-based accelerator design is first separated into two soft blocks,
and the soft block that contains data path can be decomposed either in (b) a top-down flow
or (c) a bottom-up flow.

64

that might not be human-readable, this marking process can be performed at the level of
HLS code. Specifically, system designers separate the HLS code for control and data path
and synthesize them separately to obtain the RTL module name for control path. The
decomposing tool has following five steps:
1. Build block graph: This step parses the input RTL design to extract all basic
modules and then identifies the basic modules that belong to the data path. Each of these
basic modules are assigned into one soft block. The inter-block connection is built based
on the interconnection between the corresponding basic modules.
2. Extract Intra-Block Data Parallelism: This step is applied to extract the finegrained data parallelism inside a soft block. The data parallelism can be identified by
performing the equivalence checking on the logic within a soft block [45][93][126]. A group
of child blocks will be created for one soft block if it has data parallelism (Figure 4.10a).
3. Identify Inter-Block Data Parallelism: This step checks whether two input
blocks of one soft block have data parallelism (Figure 4.10b). Three cases are considered:
1) the two input blocks are identical, then a parent block will be created for these two soft
blocks, 2) one input block has child blocks connected in data parallelism and the other
input block is the same as the child block, then these soft blocks will be grouped into a
single sub-tree, and 3) both input blocks have child blocks connected in data parallelism
and these child blocks are identical, then these child blocks will be grouped into a single
sub-tree. This step iterates through all soft blocks and terminates when no such pattern is
identified.
4. Identify Pipeline Parallelism: The step checks whether the child blocks of two
soft blocks are connected in pipeline parallelism. Specifically, if these two blocks both have
child blocks that are connected in data parallelism and the number of child blocks are the
same, then these child blocks will be grouped into a two-level sub-tree, where the top level
is data parallelism and the bottom level is pipeline parallelism (Figure 4.10c). This step
also iterates through all soft blocks and terminates when no such pattern is identified.
5. Iteration: Step 3 and 4 are repeated to identify all parallel patterns and terminate

65

𝒂+𝒃

𝒂+𝒃
Extract Data
parallelism

𝒂𝟎 + 𝒃𝟎

𝒂𝒏 + 𝒃𝒏

(a)

Block Graph

Block Tree

#1

#1

#2

#2

#1

#2

(b)

#2

#1

Soft block #1 and #2 are identical
#1
#2
#3

#1
#2

#1 #2

#3

#3

#1

#2

#3

Soft block #1, #2 and #3 are identical
#1
#2
#3
#4

#1
#2
#3
#4

#1 #2 #3 #4

#1 #2 #3 #4

Soft block #1, #2, #3 and #4 are identical

(c)
#1
#2

#3
#4

#1

#3

#2

#4
#1 #2 #3 #4

#1 #3 #2 #4

Soft block #1 and #2 are identical
Soft block #3 and #4 are identical
RTL Design

Soft Block

Child blocks connected
in data parallelism

Child blocks connected
in pipeline parallelism

Figure 4.10: Conceptual diagrams illustrate (a) the step of extracting the data parallelism
within a leaf soft block, (b) the step of identifying inter-block data parallelism, and (c) the
step of identifying pipeline parallelism.

66

when no soft block can be merged.

4.3.2

Partition Step

Similar to the partition step in the compilation framework for the two-level system abstraction (Figure 4.1), this partition step also uses a recursive partition process. Specifically, the
top-level soft block is mapped into one HL virtual block and is then partitioned into two HL
virtual blocks. The newly generated HL virtual blocks are recursively partitioned. Different
from the partition step in the compilation framework for the two-level system abstraction,
the extracted parallel patterns are leveraged to simplify this partition step. Specifically,
if the child blocks of the soft block mapped to one HL virtual block are connected in the
pipeline parallelism, the tool will examine all inter-block connections and identify the one
with the minimal communication bandwidth to divide these child blocks into two clusters.
Alternatively, if the child blocks are connected in the data parallelism, these child blocks
will be evenly grouped into two clusters. Two parent soft blocks are then created for these
two clusters, which are mapped into two HL virtual blocks. This recursive partition process
is also controlled by the parameter N and K. As discussed in Section 4.1.1, we choose
N = 1 and K = 1 for this process.
Partition Data Path Only. The above recursive partition method is a generic solution
that partitions the entire accelerator into multiple HL virtual blocks, which will be deployed
into multiple FPGAs at runtime. Nevertheless, the limited inter-FPGA communication
bandwidth and the long communication latency might degrade the performance. Leveraging
the decomposing results, we propose a technique that only partitions the data path to
effectively overlap the inter-FPGA communication and computation. This technique is
applied to the AS ISA-based accelerators that have data parallelism in the root soft block
of the data path. We expect this is a common case for AS ISA-based accelerators as most of
them implement data processors to fully exploit the abundant spatial parallelism in FPGAs.
Instead of partitioning an AS ISA-based accelerator into multiple HL virtual blocks,
we propose to scale down one accelerator into multiple smaller accelerators and map these
accelerators into HL virtual blocks. As illustrated in Figure 4.11a, scaling down one accel67

Control Path

Addr

Buffer

Write
Request
Read
Request

𝑽𝟏

SIMD Unit

SIMD Unit
𝑽𝟎

SIMD Unit

SIMD Unit
Result
Vector

𝑽𝟐

Read
Response

𝑽𝟑

(a)

(b)

Synchronization Module

Comparator

Data
Addr

Buffer

Valid
Response
from
Data
DRAM
Empty
From
Index
FIFO
other
Register
FPGA
Synchronization Module

Control Path
Buffer

0

0

𝑽𝟐

0

0

SIMD Unit

𝑽𝟏

SIMD Unit

𝟎

SIMD Unit

SIMD Unit
Result
Vector

To other
FPGA

Comparator

Inter-FPGA
connection

Control Path

To
DRAM

Result
Vector

𝑽𝟑

𝑽𝟎

𝟎

𝟎

𝟎

Figure 4.11: (a) A conceptual diagram illustrates the technique of scaling down one AS
ISA-based accelerator. Specifically, one AS ISA-based accelerator is split into two smaller
one. Each one has a complete control path and only computes part of the computation
results. We provide a template module for inter-FPGA synchronization (highlighted in
blue). (b) The key building blocks of this synchronization module are drawn in the figure.

68

erator is realized by (1) duplicating the accelerator design and (2) reducing the number of
SIMD units to obtain a smaller accelerator design, which only generates part of the computation results. As the control path (e.g., instruction decoder) is not modified, the original
software programs can still run on these small accelerators. The partial computation results
generated by these smaller accelerators needs to be combined and the execution of these
accelerators need to be synchronized. We provide a parameterized template module and
reuse the instructions for reading/writing on-board DRAM to perform inter-FPGA communication for result combination and synchronization. As illustrated in Figure 4.11b, this
template module monitors the DRAM interface for reading/writing data. If a data entry
is written into a pre-defined address (e.g., an out-of-range address), then this module will
send this data entry to the corresponding accelerator through the inter-FPGA network. If
the accelerator reads a pre-defined address, this module will send a response to the accelerator only when it receives data from another accelerator to realize a barrier synchronization
(assuming the accelerator implements an in-order processor). This module sets a flag when
identifying this special read request. When this flag is set, this module will combine the
received data entry and the data entry read from the DRAM based on the index register
for the next read request (Figure 4.11b). This module invalidates these special read/write
requests to ensure functional correctness. The parameters of this template module (e.g.,
buffer width, the value of pre-defined address and the content in the index register) are
configured during the offline compilation time. A custom tool is developed for a specific
AS ISA to automatically insert the corresponding DRAM read/write instructions for a
given software program. We also provide another custom tool for a specific AS ISA to
perform instruction reordering under the dependency constraint to maximally overlap the
communication and computation.

4.4

Results

The developed compilation framework is evaluated using the widely used Rosetta benchmark suite [164][137]. These benchmarks are highly optimized HLS-based FPGA designs

69

Top-Level Ports

AXI Interconnect IP (from Vivado)
AXI Interconnection

AXI Interconnect IP
(from Vivado)

AXI Interconnect IP
(from Vivado)

Pipeline
Registers

Pipeline
Registers

Pipeline
Registers

Pipeline
Registers

Kernel

Kernel

Kernel

Kernel

A partition
boundary with low
interconnection
bandwidth

Parameter NUM_KERNEL

Figure 4.12: A conceptual diagram illustrates the template architecture used for generating
different variants of accelerator designs. A multi-level distribution network and pipeline
registers are included for better timing.
from machine learning and image/video processing domains. To better account for the
varying performance and cost demands in the dynamic cloud environment, we create a
template architecture to scale up one given kernel so that it can process multiple tasks in
parallel, as illustrated in Figure 4.12. This organization is used to evaluate the quality of
the partition tool developed for both two-level and single-level system abstraction. Specifically, an application with this organization can be partitioned into multiple parts that has
a low bandwidth requirement for the cross-boundary communication (Figure 4.12). Thus,
this synthetic organization can be utilized to show whether the developed partition tool can
find this boundary and generate high-quality partition results when the input applications
contain such boundaries. Using this template architecture, three variants of accelerator
designs (small, medium and large) are created for each benchmark. The characteristics of
these accelerator designs are reported in Table 4.1.

4.4.1

Compilation Time

The runtime breakdown of the three compilation flows, i.e., the baseline conventional compilation flow, the compilation flow for the two-level system abstraction, and the compilation
70

Table 4.1: The resource usages of evaluated benchmarks.
Benchmark

Rendering

Digit
Recognition
Spam
Filtering
Optical
Flow

BNN

Face
Detection

Resource Usage

Size
LUTs

DFFs

DSPs

BRAMs

Small

43.5k

46.1k

48

10.3Mb

Medium

130.5k

137.3k

144

30.8Mb

Large

195.0k

179.5k

192

41.1Mb

Small

32.1k

42.5k

1

6.9Mb

Medium

88.6k

114.9k

3

20.6Mb

Large

165.0k

175.2k

5

34.3Mb

Small

41.7k

87.3k

896

9.4Mb

Medium

112.4k

151.9k

1792

18.8Mb

Large

221.3k

296.7k

3584

37.7Mb

Small

94.7k

87.5k

372

7.3Mb

Medium

157.3k

144.4k

620

12.2Mb

Large

282.6k

258.1k

1116

22.0Mb

Small

35.9k

61.6k

16

4.0Mb

Medium

71.1k

121.9k

32

8.0Mb

Large

140.3k

240.8k

64

16.0Mb

Small

92.5k

87.4k

152

5.8Mb

Medium

139.0k

130.8k

228

8.6Mb

Large

230.9k

217.7k

380

16.0Mb

71

Small
3D Rendering

Medium
Large

Small
Digit
Recognition

Medium
Large

Small
Spam Filter

Medium
Large

Small
Optical Flow

Baseline

Medium

Two-Level abstraction
with one type of LL
virtual block

Large

Single-Level abstraction

Small
Face Detection

Two-Level abstraction
with two types of LL
virtual block

Medium
Large

Small
BNN

Medium

Large
1000

2000

3000

4000

5000

6000

Runtime (s)

Figure 4.13: The runtime breakdown of different compilation process for the evaluated
accelerator designs. For each accelerator design (small, medium or large), from top to
bottom, the runtime of the baseline compilation flow, the compilation flow for two-level
abstraction that has one type of LL virtual block for one FPGA, the compilation flow for
the single-level abstraction, and the compilation flow for the two-level abstraction that has
two types of LL virtual blocks for one FPGAs are drawn.

72

flow for the single-level system abstraction, is drawn in Figure 4.13. In this runtime measurement, compilation tasks such as local placement are performed in parallel to exploit the
parallelism provided by the compilation framework. Overall, the compilation flow for the
two-level system abstraction that has one type of LL virtual block for one FPGA is 31.4%
lower than that of the conventional compilation flow on average. Nevertheless, we also note
that the compilation flow for the two-level abstraction has a relatively longer runtime time
compared to the baseline flow for the small accelerator design. This mainly because the
runtime reduction from the local placement/routing step is not sufficient to compensate the
runtime of the additional steps, e.g., the global place&route step. We verified that the additional steps using custom tools, i.e., high-level partition, placement splitting and relocation,
only incur a marginal runtime overhead, since the tasks performed by these steps have a
much lower timing complexity than the place&route steps. Finally, the two-level system
abstraction that has two types of LL virtual blocks for one FPGA can support a smaller
physical block than the alternative two-level abstraction, thereby reducing the local routing
time. Nevertheless, more physical blocks are provided by one FPGA device (Figure 3.18),
leading to a longer global place&route time. Thus, the total compilation time for these two
variants of the two-level system abstraction is comparable.
The compilation flow for the single-level system abstraction also reduces the runtime
by 22.6% compared to the conventional compilation flow. This reduction is slightly lower
than that achieved by the compilation flow for the two-level system abstraction. This
mainly caused by the overhead of the partition step. This step uses the simulated annealing
algorithm to place hundreds of thousands of logic primitives into virtual blocks and generates
the partition result based on the placement result. Although a packing step is included
in the partition tool to reduce the timing complexity, the long runtime of this step still
outweighs the runtime reduction from the local place&route step. Nevertheless, by using an
asynchronous interface in the single-level abstraction (one compilation result can be used for
different runtime deployments), this compilation flow generates fewer compilation results
than the flow for the two-level system abstraction for each virtual block. Thus, the total

73

compilation cost (the runtime summation of all compilation tasks) of this compilation flow
is 1.95× lower than that for the two-level abstraction on average, as shown in Figure 4.14.
The compilation flow for the two-level system abstraction needs to generate multiple
compilation results to account for the heterogeneity caused by the multi-die package (Section 3.2.2). Therefore, the aggregated compilation time of all compilation tasks of this
compilation flow is 4.27× longer than that of the baseline compilation flow on average (Figure 4.14). This compilation overhead is acceptable since the compilation process in the
virtualized environment is a one-time offline process (the cloud environment is not likely to
be used as a development environment). Moreover, the compilation flow for the two-level
system abstraction has a higher parallelism than the conventional compilation flow, thus, it
can better utilize the abundant resources in the cloud for compilation tasks. Providing two
types of LL virtual blocks for one FPGA further increases the compilation time by another
1.9× on average. This additional compilation cost does not lead to significant improvement
in runtime performance, as shown in Section 5.4.3. Thus, the two-level system abstraction
with one type of LL virtual block for one FPGA is preferred.

4.4.2

Compilation Quality

The compilation framework for the two-level system abstraction maximally reuses the existing commercial FPGA compilation tools (Section 4.1). Thus, the operating frequency of
the mapped FPGA accelerator is comparable to that mapped by the conventional compilation flow, as shown in Figure 4.15. The gap in the operating frequency is less than 5%.
Nevertheless, when mapping the accelerators onto the single-level system abstraction, the
gap in the operating frequency is about 10% even using a large n, as shown in Figure 4.16.
This mainly because the partition tool developed for the single-level system abstraction
only considers wire length when performing the placement (Figure 4.7). More sophisticated
factors, such as routing congestion, are not implemented in this partition tool to reduce
its timing complexity. Thus, the quality of the placement result generated by this partition tool is lower than that of the commercial placement tool, leading to a lower operating
frequency. Nevertheless, this simplified placement process leads to a good scalability with
74

Small
3D Rendering

Medium
Large

Small
Digit
Recognition

Medium
Large

Small
Spam Filter

Medium
Baseline

Large

Compilation Flow for
Two-Level System
Abstraction

Small
Optical Flow

Medium
Compilation Flow for
Single-Level System
Abstraction

Large

Small
Face Detection

Compilation Flow for
Two-Level System
Abstraction

Medium
Large

(Two types of LL virtual
block for one FPGA)

Small
BNN

Medium
Large

0

2

4

6

8

10

12

Normalized Total Compilation Time

Figure 4.14: The aggregated compilation time of different compilation flows.

75

Normalized Operating Frequency (%)

100
80
60

40
20
0

3D
Rendering

Digit
Recognition

Spam
Filter

Optical
Flow

Face
Detection

BNN

Figure 4.15: The operating frequency of the accelerators mapped onto the two-level system
abstraction, which is normalized to that mapped by the conventional FPGA flow. For each
benchmark, the result of three accelerator variants are provided (from left to right is large,
medium and small).
respect to the parameter n (Figure 4.16). Thus, the parameter n is set to 16 in our compilation framework when mapping applications onto the single-level system abstraction to
obtain the optimal mapping quality.
Another key quality metric is the required inter-block communication bandwidth, which
is drawn in Figure 4.17. This result confirms that the compilation framework for the twolevel system abstraction can fully utilize both off-chip and on-chip interconnection networks.
Specifically, the high-level partition step can effectively find the appropriate boundary to
partition applications into multiple HL virtual blocks with a low requirement on the communication bandwidth (< 10Gb/s). This largely reduces the burden on the off-chip interFPGA network. On the contrary, enabled by the synchronous interface between the LL
virtual blocks, the monolithic placement step can fully utilize the on-chip routing fabric to
ensure the mapping quality. In the single-level system abstraction, a unified asynchronous
interface is used for both off-chip and on-chip interconnections. Thus, the compilation
framework for this abstraction cannot separately process this two types of interconnections.
Consequently, all inter-block interconnections require similar communication bandwidth, as
shown in Figure 4.17. On the one hand, the generated communication interfaces cannot

76

Average Partition Normalized Operating
Time (s)
Frequency (%)

100
80
60
40
20
0

91.1%

1

2

4
n

8

16

1

2

4
n

8

16

300
200
100
0

Figure 4.16: The average operating frequency obtained under different n values, which
is normalized to that mapped by the conventional FPGA compilation flow. The average
partition time with different n values is also reported. The parameter n is defined in
Section 4.2.
fully utilize the on-chip routing fabric. On the other hand, the required communication
bandwidth is also higher than that provided by the inter-FPGA interconnection network.
Therefore, it is preferred to apply the single-level system abstraction for small applications
that do not have strict requirements on performance.
When mapping these benchmarks onto the proposed system abstractions, BRAM resources are the bottleneck. Thus, it is meaningful to report the utilization of the BRAM
resources provided by the allocated physical blocks. In the two-level and single-level system
abstraction, the amount of wasted BRAM resources caused by the internal fragmentation
issue is 9.7% ∼ 37.5%. For the two-level abstraction with two types of LL virtual blocks for
one type of FPGA, the resource waste caused by internal fragmentation is 4.3% ∼ 16.9%,
which is lower than that in the other two abstractions due to the smaller physical blocks.
It is hard to reduce the resource waste caused by the internal fragmentation, since it is
hard to create identical physical blocks that can be fully utilized by diverse workloads with
distinct resource usages. One possible solution is mapping multiple LL virtual blocks into
one physical block. However, as discussed in Section 3.2.2, this mapping strategy increases
77

Large
3D Rendering

Results for
Two-Level
Abstraction

Medium
Small

Results for
Single-Level
Abstraction

Large
Digit
Recognition

Medium
Small

Large
Spam Filter

Medium
Small

Large
Optical Flow

Medium
Small

Large
Face Detection

Medium
Small

Large
BNN

Medium
Small

1

100

10

1000

10000

Required Communication Bandwidth (Gb/s)

Figure 4.17: The required communication bandwidth of the inter-block interconnections
when mapping applications onto the two-level system abstraction and the single-level system
abstraction. Enabled by the two-level system abstraction, the corresponding compilation
framework can effectively identify the boundary with the low bandwidth requirement to
partition these benchmarks. On the contrary, due to the unified asynchronous interface
in the single-level abstraction, the corresponding compilation framework cannot find such
boundary.
78

the runtime management complexity and might also lead to security concerns. The internal
fragmentation issue needs more careful exploration, which is one possible future work.

4.4.3

Case Study: AS ISA-based Accelerator

In this case study, an AS ISA similar to the one proposed in Microsoft BrainWave project [40]
is applied to evaluate the performance of the specialized compilation framework (Section 4.3). This AS ISA is chosen because it is a representative use case and has been
deployed in the commercial FPGA cloud to build a product-scale system for the low-latency
DNN inference. Specifically, we develop a parameterized accelerator design for this AS ISA
as the design of the BrainWave project is not publicly available. The organization of this
accelerator design is similar to that described in [40], e.g., tile engines and multi-function
units as illustrated in Figure 4.18. It uses the block floating point format (BFP) for the
matrix-vector multiplication to increase the computing capability and half-precision floating point format (float16) for other secondary operations to reduce quantization noise (e.g.,
point-wise vector multiplication and activation). The number of tile engines in this design
can be adjusted to generate accelerator instances with different computing capabilities to
account for the varying performance/cost demands. A parameterized memory module is
developed so that the accelerator design can leverage the URAM resources when being
mapped onto the UltraScale+ FPGA. While this solution provides a unified memory interface to simplify the design of the accelerator, it leads to a under-utilization of the URAM
resources as URAM provides a large capacity (4096 72-bit words) than the BRAM does
(512 72-bit words). Further exploration on optimizing the accelerator design could be one
of the future work. Moreover, an instruction buffer is included in the accelerator design to
minimize the memory access.
This accelerator then can be decomposed using the provided compilation framework.
The root soft block of the data path has child blocks connected in the pipeline parallelism.
Nevertheless, the FP16-to-BFP converter and the vector register file are much smaller than
the remaining components. Thus, we mark these two components as control logic and
group them into the soft block that contains the control logic (Figure 4.18). With this
79

Instruction
Decoder/Scheduler

MFU

MFU

Vector Register File

Tile
Engine
Decompose

Tile
Engine

Vector Register File

FP16-to-BFP Converter

Instruction
Decoder/Scheduler

DPE

BFP to
FP16

MFU

DPE

BFP to
FP16

MFU

Soft Block With
Control Logic

MFU

MFU

Soft Block With
Datapath

Figure 4.18: A conceptual diagram illustrates the organization of the AS ISA-based accelerator design and the decomposing results.
Table 4.2: Hardware implementation results of the two baseline accelerators.
Device

#MVM

LUTs

DFFs

BRAMs

URAMs

DSPs

Tiles

Freq.

Peak

(MHz)

TFLOPS

VU37P

21

610k

659K

51.5Mb

22.5Mb

7517

400

36

KU115

13

367k

386k

45.4Mb

-

5073

300

16.7

modification, the soft block of the data path has data parallelism and the optimization
described Section 4.3.2 can be applied on this accelerator design.
In order to provide a high-quality baseline and ensure a fair comparison, the floorplanning function provided by Vivado is applied to improve the implementation quality
by manually optimizing the placement, as shown in Figure 4.19a. The resource usage and
the performance of the baseline accelerator on the two types of FPGAs are reported in
Table 4.2. By leveraging floorplanning, the peak throughput of the baseline accelerator is
comparable to that reported in [40]. This floorplanning is also reused when placing virtual
blocks (LL virtual block in the two-level abstraction and virtual block in the single-level
abstraction) into physical blocks for a fair comparison (Figure 4.19b).
DeepBench [97] that contains representative layers from various DNN models is used

80

(a)
(b)

Figure 4.19: The floorplanning is leveraged to improve the mapping quality of the baseline
accelerator. Part of the floorplanning used for XCVU37P FPGA is shown in (a). This
function is also leveraged to improve the mapping quality of one virtual block to ensure a
fair comparison. The optimized implementation result is shown in (b).
to evaluate the compilation quality. Specifically, this benchmark suite provides several
GRU/LSTM inference tasks and the latency of these tasks with a batch size of one is
measured. Two different scenarios are considered when evaluating the inference latency: 1)
the AS ISA-based accelerator used to process one inference task is deployed onto a single
FPGA device (no inter-FPGA communication overhead), and 2) the accelerator is deployed
onto two FPGA devices. The inference latency in the first scenario is reported in Table 4.3.
We observe that there is only a marginal increase in the inference latency (3% ∼ 8%),
which is mainly caused by the latency-insensitive interface between HL virtual blocks. This
negligible overhead is achieved by leveraging the extracted parallel patterns. Specifically,
the partition tool described in Section 4.3.2 can avoid placing the pipelined data path within
a SIMD unit across HL virtual blocks by leveraging the parallel patterns. Consequently, it
effectively minimizes the additional latency introduced by the latency-insensitive interface.
We then evaluate the impact of the inter-FPGA communication latency when one AS
ISA-based accelerator is deployed onto two FPGA devices. We implement a programmable
module that includes a counter and a FIFO on FPGAs to intentionally add a certain amount
of latency into the inter-FPGA communication. This allows us to comprehensively evaluate
the effectiveness of the proposed optimization technique (Section 4.3.2) under various conditions. As shown in Figure 4.20, the proposed technique can effectively hide the inter-FPGA
81

Table 4.3: The latency of LSTM/GRU inference tasks.
Benchmark

Latency (ms)

Device
Baseline

This dissertation

Overhead

GRU

VU37P

0.0131

0.0136

3.8%

h=512 t=1

KU115

0.0227

0.0236

3.9%

GRU

VU37P

5.01

5.4

7.8%

h=1024 t=1500

KU115

18.5

19.9

7.8%

GRU

VU37P

1.83

1.96

7.5%

h=1536 t=375

KU115

6.91

7.43

7.5%

LSTM

VU37P

0.726

0.767

5.7%

h=256 t=150

KU115

1.31

1.38

5.6%

LSTM

VU37P

0.129

0.136

5.3%

h=512 t=25

KCU115

0.232

0.245

5.3%

LSTM

VU37P

0.146

0.157

7.0%

h=1024 t=25

KCU115

0.263

0.282

7.1%

LSTM

VU37P

0.238

0.258

8.4%

h=1536 t=50

KCU115

-

-

-

-: Cannot fit into the FPGA.

82

0.2

Inference Latency (ms)

Inference Latency (ms)

GRU

7
6

h=1024 t=1500

5
4
h=2560 t=375

2
1
0

h=1024 t=25

0.1

3

0

0.2
0.4
0.6
0.8
1.0
Additional Latency (µs)

LSTM

h=2048 t=25

0.2
0.4
0.6
0.8
1.0
Additional Latency (µs)

Figure 4.20: The impact of the inter-FPGA communication latency on the inference latency
when the AS ISA-based accelerator is deployed onto two FPGA devices
communication latency for LSTM inference tasks by overlapping the data transfer of vector
ht and the matrix multiplication related to xt . For GRU inference tasks, this technique can
overlap the data transfer and computation for small GRU model (h = 1024) when the added
communication latency is less than 0.6µs. Nevertheless, the inter-FPGA communication
latency cannot be hided for a large GRU model (h = 2560). This is because a large GRU
model needs a large AS ISA-based accelerator that provides sufficient on-chip storage for
weight. Such a large accelerator also provides more computation capability than the one
used for a small model, leading to a shorter computation time. On the other hand, the data
transfer time increases in a larger model as it has a longer vector. Therefore, compared
with small GRU models, it is harder to hide the inter-FPGA communication latency for
large GRU models.
Overlapping the inter-FPGA communication latency and computation means that one
application can be deployed onto multiple physical FPGA devices without affecting the
inference latency. Thus, this optimization technique allows us to treat more applications
as batch workloads (not sensitive to inter-FPGA communication latency) during runtime
deployment, which improves the runtime performance as shown in Figure 5.6b,

83

Chapter 5
Scheduling and Resource Management
This chapter presents a preliminary exploration of the runtime system for cloud FPGAs.
Specifically, enabled by the proposed two-level system abstraction (Section 3.2), a modular
runtime management system is designed for the heterogeneous FPGA cluster. This modular
design provides extendability to support clusters with different types of FPGAs. This
chapter then describes a heuristic-based policy for resource allocation to better utilize the
flexibility provided by the proposed system abstraction and avoid performance degradation
caused by the resource fragmentation and the long inter-FPGA communication latency. A
scheduling policy is also developed that considers the distinct characteristics of different
cloud instances (Section 2.5) to improve the overall system performance.

5.1

Modular Runtime System

Due to the hardware rolling upgrade strategy, the types of FPGAs contained in one heterogeneous FPGA cluster keep changing. Thus, it is necessary to design an extendable runtime
system to support such a scenario. Enabled by the proposed two-level system abstraction,
a two-level modular runtime management system is designed, as illustrated in Figure 5.1a.
Overall, this runtime management system comprises a top-level manager for task scheduling
and multiple bottom-level managers (one for each type of FPGAs) to perform the resource
allocation and low-level tasks such as loading bitstreams onto FPGAs. New bottom-level
manager can be added into this runtime system when a new type of FPGAs is deployed to
provide a good extendability.
84

Hypervisor
APIs

Compilation
Database
(HL Virtual
Block Only)

Compilation
Database
(LL Virtual
Block Only)

Resource
Allocator

Controller
for FPGA
Type 1

Resource
Database

Task Queue: On-Demand Instance
Task Queue: Spot Instance

Compilation
Database
(LL Virtual
Block Only)

Controller
for FPGA
Type 1

Top-Level
Manager

Bottom-Level
Manager

Resource
Database

FPGA
Type 2

FPGA
Type 1
Heterogeneous FPGA Cluster
Hypervisor

(a)
(b)

APIs
Compilation
Database
(Virtual Block)

Resource
Database

Resource
Allocator

Task Queue: On-Demand Instance
Task Queue: Spot Instance

Homogeneous
FPGA Cluster

Figure 5.1: (a) The two-level modular runtime management system for the heterogeneous
FPGA cluster. The on-demand and spot instances are defined in Section 2.5. (b) The
single-level runtime management system for the homogeneous FPGA cluster.

85

As illustrated in Figure 5.1a, the top-level manager maintains two task queues to schedule on-demand and spot instances separately using the scheduling policy described in Section 5.2. To allocate FPGA resources for one scheduled instance, this top-level manager
first obtains the HL virtual blocks generated for the corresponding application from the
database that stores the compilation results. It then sends these HL virtual blocks to all
bottom-level managers. After receiving one HL virtual block, the bottom-level manager
first obtains the corresponding LL virtual block arrays and the resource availability of the
specific type of FPGAs from the database (Figure 5.3). It then uses the heuristic-based
policy (Section 5.3) to allocate LL virtual block arrays for deploying the given HL virtual
block. The bottom-level manager finally returns the highest heuristic score for deploying
one HL virtual block into the corresponding type of FPGAs to the top-level manager. After
collecting the heuristic scores from all bottom-level managers, the top-level manager uses
a heuristic method to generate the optimal resource allocation and then sends requests to
the corresponding bottom-level managers to deploy the scheduled instance.

5.1.1

Specialized for A Homogeneous FPGA Cluster

Similar to the process of merging the two-level system abstraction into a single-level one
for a homogeneous FPGA cluster, the two-level runtime management system is merged
into a single-level one to manage the homogeneous FPGA cluster, as illustrated in Figure 5.1b. Specifically, the runtime manager for homogeneous FPGA cluster performs both
task scheduling and resource allocation using the policy described in Section 5.2 and 5.3,
respectively.

5.2

Task Scheduling Policy

As illustrated in Figure 5.1a, the top-level manager maintains two task queues to schedule
on-demand and spot instances separately. Specifically, on-demand instances are scheduled in
a first-come first-served (FCFS) manner to guarantee the performance, while spot instances
are scheduled whenever the FPGA cluster has sufficient resources to improve the aggregated
system performance by exploring the opportunity of task backfilling [39]. When the FPGA

86

#𝑪𝒐𝒏𝒕𝒊𝒈𝒖𝒐𝒖𝒔 𝑷𝒉𝒚𝒔𝒊𝒄𝒂𝒍 𝑩𝒍𝒐𝒄𝒌𝒔
𝑭𝒓𝒂𝒈𝒎𝒆𝒏𝒕𝒂𝒕𝒊𝒐𝒏 𝑺𝒄𝒐𝒓𝒆 = 𝑮𝒆𝒐𝒎𝒆𝒂𝒏(
)
𝑭𝑷𝑮𝑨 𝑪𝒂𝒑𝒂𝒄𝒊𝒕𝒚

Application #1

𝟎. 𝟑𝟎𝟔

𝟎. 𝟐𝟓

𝟎. 𝟔𝟐𝟓

Application #2

LL Virtual
Block Arrays

FPGA

(a)

(b)

FPGA

FPGA

FPGA

Figure 5.2: (a) A conceptual diagram illustrates that an inappropriate resource allocation
leads to resource fragmentation issue. (b) A conceptual diagram illustrates the calculation
of the fragmentation score. Service region in FPGAs is not drawn for simplicity.
cluster does not have sufficient resources for a newly arrived on-demand instance, deployed
spot instances will be interrupted and evacuated from the cluster one by one based on
the deployment sequence until the cluster has sufficient resources. The evacuated spot
instances are placed at the end of the corresponding task queue, thus, spot instances are
backfilled in a round-robin manner to ensure fairness. It is possible that the newly arrived
on-demand instance cannot be deployed after evacuating all running spot instances. Then
the system controller will try to deploy it again when one running on-demand instance
is terminated. This scheduling policy is effective in the cloud environment, which has
insufficient runtime information to support a more sophisticated policy. For instance, it
is impossible to obtain/estimate the completion time of instances, as instances can be
terminated by users at anytime under the pay-as-you-go pricing mechanism.

5.3

Resource Allocation Policy

A heuristic-based resource allocation policy is provided to minimize the resource waste
caused by the fragmentation issue and the performance degradation due to inter-FPGA
communication. Specifically, as an array of LL virtual blocks for one HL virtual block needs
to be deployed into contiguous physical blocks (Section 3.2.2), arbitrarily allocating physical
blocks is likely to cause resource fragmentation, as illustrated in Figure 5.2a. To address
87

1

Request to deploy
one application

7

Compilation
Database

Top-Level
Manager

(HL Virtual
Block Only)

#1

2
OR

Calculate the
fragmentation
score

Resource
Database

OR

#2_1
0.7

#2_1

+

#2_2

Fetch LL virtual
block arrays and
resource availability

+

#2_2
0.5

8
Find the allocation
with the highest score
and send it to the
corresponding
bottom-level manager

6

3

Bottom-Level
Manager

4

#1

𝟎. 𝟓𝟗 × 𝒑

Fetch HL
virtual blocks

Send
back

Send to
bottom-level
manager
Compilation
Database
(LL Virtual
Block Only)

𝟎. 𝟔

5

#1
0.6

OR

#2_1
0.7

+

#2_2
0.5

Find the optimal
fragmentation score for
each HL virtual block

Figure 5.3: A conceptual diagram to illustrate the flow of allocating resources for one
application. Only one bottom-level manager is drawn for simplicity.
this issue, a fragmentation score is calculated for every possible resource allocation. As
illustrated in Figure 5.2b, this fragmentation score is calculated as the geometric mean of the
ratio between the number of contiguous physical blocks after one specific resource allocation
and the total number of physical blocks provided by one FPGA. A higher fragmentation
score means this FPGA can provide more contiguous physical blocks (thus less resource
fragmentation) after this resource allocation.
The bottom-level manager calculates the fragmentation score for all possible resource allocations of one HL virtual block. The highest fragmentation score for this HL virtual block
is returned to the top-level manager. As one application might be partitioned into multiple
HL virtual blocks, the top-level manager generates all possible combinations for deploying
one specific application, as illustrated in Figure 5.3. A fragmentation score is calculated
for each combination, which is the geometric mean of the score of the HL virtual blocks in
one combination. For the combination that deploys one application into multiple FPGAs,
a multi-FPGA penalty p < 1 is applied to avoid severe performance degradation caused by

88

the inter-FPGA communication. The basic principle for determining the value of p is an
application whose performance is sensitive to the inter-FPGA communication latency will
have a smaller p. For instance, an application that performs streaming processing will have
a smaller p compared to one that performs batch processing. For instance, the application
that performs batch processing can set p = 1, while the application that performs stream
processing can set p = 0. This p coefficient can also be controlled by users to account for
the varying demand on performance and cost. For example, users that do not have strict
requirements on performance can set p to 1 to minimize the cost. Finally, the combination
with the highest fragmentation score is applied to deploy the application into the FPGA
cluster (Figure 5.3).

5.3.1

Possible Extension

This heuristic-based method can be easily extended to take more factors into consideration.
For example, if the required DRAM bandwidth of one application is available, then a
new DRAM contention score can be calculated to minimize the resource contention when
deploying multiple applications onto the same FPGA device. One possible way to calculate
P
the DRAM contention score is T otal DRAM Bandwidth/ Required Bandwidth, which
is the higher the better. Other factors such as power consumption (contention on the power
distribution network) can also be included. These possible extensions require additional
profiling tools to obtain the applications’ characteristics, which are not included in this
dissertation, as they are not indispensable building blocks for a virtualization framework.

5.4

Results

As there is no publicly available real-world cloud workloads using FPGAs, we follow the
widely used approach [101] to synthetically generate several workload sets to evaluate the
scheduling and resource allocation policy. Each workload set contains a sequence of accelerator designs from the benchmark set used for evaluating the compilation framework
(Table 4.1). The requests for deploying these workloads are issued with a random time interval to emulate the dynamic cloud environment. A resource contention ratio is calculated

89

to characterize these workload sets. This resource contention ratio is calculated as the ratio
between the amount of FPGA resources that are required so that every workload can be
immediately deployed without waiting and the amount of FPGA resources provided by the
cluster. When this ratio is smaller than or equal to one, no contention happens and every
workload can be deployed without waiting. The ratio between the number of on-demand
and spot instances and the ratio between the number of batch and streaming workloads in
one workload set can also be adjusted to provide a comprehensive evaluation.
For one workload set, all workloads will be compiled onto the proposed system abstraction using the provided compilation framework to obtain the key performance metric, such
as throughput, latency and resource usage. This information is the input of a software emulator that contains a controller and a FPGA status maintainer. Specifically, the controller
implements the described scheduling policy and resource allocation policy, while the FPGA
status maintainer using the results generated from the compilation framework to track the
FPGA status. This emulator can generate a trace that contains the resource allocation and
deallocation operations, which is validated on the custom-built FPGA cluster (Section 3.5).
This emulator outputs the normalized response time as the performance metric, which is
calculated as
wait time + execution time
execution time

5.4.1

Design Space Exploration on Parameter N and K

We perform a design space exploration to evaluate the impact of the parameter N and
K. As shown in Figure 5.4-top, with a fixed parameter K, increasing parameter N from 0
(applications can only be mapped into a single FPGA) to 1 (applications can be mapped
into up to two FPGAs) effectively reduces the normalized response time of the on-demand
instances by 20.4% under the high resource contention, while further increasing it to 2
(applications can be mapped into up to four FPGAs) only leads to a marginal reduction
(< 1%). A similar trend is also identified for the spot instances. This is because N = 0
cannot enable FPGA sharing across physical FPGA boundaries, degrading the runtime

90

N=1
K=2

N=0
K=2

60

On-demand Instances
Normalized Response Time

Normalized Response Time

1.5

1.0

0.5

0

0.15 0.28 0.45 0.6 0.75 0.9 1.0
Resource Contention Ratio

40
30
20
10

40
Normalized Response Time

Normalized Response Time

On-demand Instances

1.0
0.8
0.6
0.4
0.2
0

0.15 0.28 0.45 0.6 0.75 0.9 1.0
Resource Contention Ratio

0.15 0.28 0.45 0.6 0.75 0.9 1.0
Resource Contention Ratio

N=1
K=2

N=1
K=1

1.2

Spot Instances

50

0

1.05 1.10

1.05 1.10

N=1
K=3

Spot Instances

30
20
10
0

1.05 1.10

N=2
K=2

0.15 0.28 0.45 0.6 0.75 0.9 1.0 1.05 1.10
Resource Contention Ratio

Figure 5.4: The normalized response time for on-demand and spot instances under different
N and K. The percentage of on-demand instances and batch workloads are 50%.

91

Non-virtualized Environment

Normalized Response Time

100
Spot Instances
90
80
70
60
1
50
1
40
1.003
30
20
1.002
10
1.005
0

Normalized Response Time

100
On-demand Instances
90
80
70
60
50 1 1
40
1.003
30
20
1
10
1.001
0

0.15 0.28 0.45 0.6 0.75 0.9 1.0
Resource Contention Ratio

Two-Level System Abstraction

1.05 1.10

0.15 0.28 0.45 0.6 0.75 0.9 1.0
Resource Contention Ratio

(a)

Two-Level System Abstraction

On-demand Instances
Normalized Response Time

Normalized Response Time

1.5

1.0

0.5

0

0.15 0.28 0.45 0.6 0.75 0.9 1.0
Resource Contention Ratio

1.05 1.10

Single-Level System Abstraction

40

Spot Instances

30
20

10
0

(b)

1.05 1.10

0.15 0.28 0.45 0.6 0.75 0.9 1.0
Resource Contention Ratio

1.05 1.10

Figure 5.5: (a) The comparison of the normalized response time over the non-virtualized
environment for the heterogeneous FPGA cluster. (b) The comparison of the normalized
response time delivered by the two-level system abstraction and single-level system abstraction for the homogeneous FPGA cluster. The results of the non-virtalized environment is
not drawn in (b) for better clarity.
performance due to the lack of multi-FPGA support. Since N = 1 already provides multiFPGA support that effectively reduces external resource fragmentation, larger N only leads
to a marginal improvement. We also confirm that the parameter K has a marginal impact
on the normalized response time (< 4% as shown in Figure 5.4-bottom), which is consistent
with our analysis in Section 4.1.1. Thus, we use parameter N = 1 and K = 1 in all
evaluations.

5.4.2

Improvement Over Non-virtualized Environment

The runtime performance of the virtualized environment is compared with that of the nonvirtualized environment (i.e., allocating an entire FPGA to one application). Overall, the
92

two-level system abstraction can effectively reduce the normalized response time compared
to the non-virtualized case under high resource contention (Figure 5.5a). Since the wait
time will be accumulated when deploying a sequence of applications, it is meaningful to
report the highest resource contention ratio that can be supported when the normalized
response time is lower than a given threshold. Specifically, if the normalized response time
is required to be lower than 1.005 (to avoid a rapid accumulation of wait time), the twolevel system abstraction can support a 1.62× higher resource contention ratio compared
to the non-virtualized case when using the same FPGA cluster. This is enabled by the
fine-grained FPGA sharing enabled by the two-level system abstraction. The heuristicbased resource allocation policy also effectively reduces the resource waste caused by the
external fragmentation. Specifically, the utilization of the physical blocks is about 94% on
average. Figure 5.5b shows the performance comparison between the two-level system abstraction and single-level system abstraction on a homogeneous FPGA cluster. Specifically,
if the normalized response time is required to be lower than 1.005, the two-level system abstraction can support a 1.19× higher resource contention ratio compared to the single-level
abstraction when using the same FPGA cluster. This is because the additional communication region in the single-level system abstraction reduces the amount of FPGA resources
provided by the FPGA devices. We also confirm that the runtime policy can support the
dynamic cloud environment and provide a stable performance under different composition
of cloud instances and workloads (Figure 5.6ab).

5.4.3

Comparison between Variants of Two-Level System Abstraction

The performance comparison between the two variants of the two-level system abstraction
is shown in Figure 5.7. By managing the FPGA resource in a more fine-grained manner, the two-level system abstraction that provides two types of LL virtual blocks for one
FPGA effectively reduces the normalized response time by up to 20.5% compared with
the one with one type of LL virtual block. Nevertheless, the reduction for the on-demand
instance is marginal (< 3%). This mainly because that the runtime scheduling policy can
effectively reduce the normalized response time for the performance-driven on-demand in93

3
2
1
0

Spot Instance

Normalized Response Time

Normalized Response Time

On-demand Instance

0.1 0.2

0.3 0.4

0.5 0.6

0.7

0.8 0.9

0.99

(a) Percentage of on-demand instances

9
8
7
6
5
4
3
2
1
0

0.1 0.2

(b)

0.3 0.4 0.5 0.6 0.7 0.8 0.9
Percentage of batch workloads

0.99

Figure 5.6: The normalized response time under different percentages of (a) on-demand
instances and (b) batch workloads. The resource contention ratio is 0.9 in both experiments.
The percentage of batch workloads is 50% in (a), and the percentage of on-demand instances
is 50% in (b).
stances. Consequently, the improvement from the abstraction is limited. Since the two-level
system abstraction that provides two types of LL virtual blocks for one FPGA has a higher
compilation cost (Section 4.4.1), it is preferred to utilize the simple one that only provides
one type of LL virtual block for one FPGA.

94

Two-Level System Abstraction With One
Type of LL virtual block for one FPGA

1.0
0.8
0.6
0.4
0.2
0

40

On-demand Instances
Normalized Response Time

Normalized Response Time

1.2

Two-Level System Abstraction With Two
Types of LL virtual block for one FPGA

30
20
10
0

0.15 0.28 0.45 0.6 0.75 0.9 1.0 1.05 1.10
Resource Contention Ratio

Spot Instances

0.15 0.28 0.45 0.6 0.75 0.9 1.0
Resource Contention Ratio

1.05 1.10

Figure 5.7: The performance comparison between the two different variants of the two-level
system abstraction.

95

Chapter 6
Extend to Liquid Silicon
This chapter presents a new reconfigurable architecture, namely Liquid Silicon, that is used
as a case study to show that the proposed virtualization solution can be extended to other
spatial reconfigurable architectures. Enabled by the non-volatile memory technology (i.e.,
RRAM), Liquid Silicon has a homogeneous architecture comprising a two-dimensional (2D)
array of identical “tiles”. Different from the heterogeneous FPGA architecture that uses
specialized hard IP blocks for dedicated functions (Figure 2.1), each tile in Liquid Silicon
can be configured into one or a combination of four modes: heavy-weight compute mode,
light-weight compute mode, interconnect mode, and memory mode. Such flexibility allows
users to partition resources based on applications’ needs, in contrast to the fixed resource
provisioning in FPGAs that is determined by the vendors during manufacturing. The
following sections first present the necessary background information and then describe the
architecture and the custom compilation framework developed for Liquid Silicon. A chip
demonstration of Liquid Silicon is also provided. Finally, the method of extending the
proposed virtualization solution to Liquid Silicon is presented.

6.1
6.1.1

Background
RRAM and Access Device

Resistive random access memory (RRAM) is one promising non-volatile memory technology
because of the small cell size (4F 2 ), fast switching time (as low as 10ns [122]), excellent

96

W-plug
Top Electrode

Current (A)

Top Electrode
Ta2O5-d

Bottom Electrode
TaOx

W-plug

10

-2

10

-3

10

-4

Gradual
RESET

10-5
10

Abrupt
SET

-6

10-7
10

-8

10-9
-1

(a)

-0.5 0
0.5
Voltage (V)

1

(b)

Figure 6.1: (a) The Ir/T a2 O5−δ /T aOx /T aN structure [132] of one RRAM cell. (b) The
resistive switching I-V curve.
scalability (< 10nm [134]), and good endurance (up to 1012 cycles [79]). The T aOx RRAM
device from Panasonic [132] is used in this dissertation that has already been used in
commercial products since 2013 [103]. The structure and the resistive switching I-V curve
of this T aOx RRAM cell are drawn in Figure 6.1. This RRAM device is used to build
the crossbar array in Liquid Silicon. Benefiting from the CMOS-compatible monolithic 3D
fabrication process, the RRAM crossbar array can be stacked atop CMOS circuits in the
back end of line (BEOL), thereby not consuming the die area as illustrated in Figure 6.4.
In the crossbar array, an access device is needed to pair with one RRAM device to
suppress the leakage current on the sneak path. This access device can effectively reduce
the power consumption when writing the RRAM array [20]. It also eliminates the sneak
path leakage current during computation (Section 6.2.4-Crossbar Array). Among various
access devices [113][128], we choose the FAST selector [62] to build the crossbar array in Liquid Silicon. It is a two-terminal bi-directional diode with a high selectivity (∼ 1010 ), a steep
turn on slope (< 5mV /dec), a BEOL-compatible fabrication process, and an adjustable
turn on voltage.

97

6.1.2

Related Work

Several works proposed to use the nanowire-based crossbar array to build nanoscale reconfigurable computing architectures [34][36][46][109]. In these architectures, a group of
logic gates can be implemented by nanowire-based crossbar arrays. Due to the small feature size of the nanowire, they consume less area and achieve higher performance than the
CMOS-based implementations. Liquid Silicon is different from them in two aspects. At
first, besides implementing logic functions, crossbar arrays in Liquid Silicon can also implement other functions, e.g., memory and ternary content-addressable memory (TCAM),
thereby improving the hardware utilization for supporting diverse workloads. In addition,
these nanowire-based architectures require the logic functions mapped onto one crossbar
have the same data-flow direction, i.e., all inputs need to be applied on the word-lines and
all outputs are on the bit-lines, and vice versa. On the contrary, Liquid Silicon allows more
fine-grained control, i.e., inputs and outputs can have different data-flow directions in one
crossbar. This flexibility is utilized by the custom compilation framework to improve the
mapping quality.
Numerous research efforts have been devoted to investigate novel FPGA architectures
based on non-volatile memory technologies [22][25][42][53][84]. In these architectures, nonvolatile memory cells (e.g., RRAMs) are used to either 1) replace the SRAM cells in LUTs,
2) replace the pass gates in routing fabric (connection blocks and switch blocks) as programmable switches, or 3) build dense on-chip memory blocks. Benefiting from the BEOLcompatible fabrication process and non-volatility, these implementations reduce the chip
area and power consumption, without changing the basic architecture of FPGAs. Nevertheless, these architectures use non-volatile memory cells as a direct drop-in replacement
of the SRAM cell. On the contrary, Liquid Silicon provides a radically different reconfigurable architecture that is tailored to the RRAM technology, which allows flexible resource
partitioning among computation, storage and routing.
In another interesting work, a configurable memory array is built upon a crossbar array
using conventional SRAM [60], which can also be configured to perform TCAM/CAM func98

User-controller resource partition in Liquid Silicon

Tile configured in different modes:
Memory
Light-weight compute
Heavy-weight compute
or interconnect

Low

Data

Search

Compute

Intensive

Intensive

Intensive

High Compute-to-memory
access ratio
FPGA blocks:

FPGA provides
1) Fixed resource partition between
computation and storage
2) Limited on-chip storage

Block RAM (data storage)
Configurable logic block
(computation)

*Routing blocks in FPGAs are not drawn

Figure 6.2: Liquid Silicon provides a user-controlled resource provisioning to cover the whole
spectrum, from data-intensive to compute-intensive. On the contrary, FPGAs only provide
an efficient support on compute-intensive applications.
tion and bit-wise logic operations. It stores words column-wise in TCAM/CAM mode, but
row-wise in logic mode. Consequently, it requires data reshuffling when performing different
operations, thereby reducing the flexibility and efficiency. On the contrary, Liquid Silicon
does not have any of these restrictions, and words (data entries) can be stored in either
direction. Additionally, this work only implements a single SRAM block to realize simple
bit-wise logic function (e.g., AND and NOR), while Liquid Silicon can implement arbitrary
complex logic with a full-fledged compilation tool to support the application mapping.

6.2
6.2.1

Liquid Silicon Architecture
Overview

Liquid Silicon is a homogeneous architecture that comprises a 2D array of identical building
blocks (also referred to as “tiles“), as illustrated in Figure 6.4. Different from the islandstyle FPGA architecture that contains specialized hard IP blocks for dedicated functions,
each tile in Liquid Silicon can be configured into one or a combination of four distinct
modes: 1) light-weight compute mode, 2) heavy-weight compute mode, 3) interconnect
mode, and 4) memory mode, depending on the workloads (Figure 6.2). In addition to the
99

Resource used for:
Heavy-weight compute
Interconnect

FPGA

Liquid Silicon

(a) (b)

Figure 6.3: (a) To improve resource utilization, one tile can be partitioned between heavyweight compute mode and interconnect mode. (b) This flexibility results in better mapping
with low routing pressure compared to FPGAs.
coarse-grained (tile-wise) configuration that allows Liquid Silicon to provide an adjustable
compute-to-memory access ratio (defined in [159]), Liquid Silicon also allows a flexible
resource partitioning within a tile between heavy-weight compute and interconnect in a
more fine-grained manner based on the actual usage (Figure 6.3a). Such a combination
of coarse-grained and fine-grained controls leads to an improved utilization and reduced
routing pressure over conventional FPGAs (Figure 6.3b).
Despite the rich configuration modes supported by each tile, the tile has a relatively
simple structure with two basic building blocks: a crossbar array and a set of connection
nodes, as illustrated in Figure 6.4.
The crossbar array contains multiple work-lines (WLs) and bit-lines (BLs), where one
cell containing one diode and one RRAM is placed at the intersection of a WL and a BL,
as illustrated in Figure 6.4. The array itself can be fully reused across the four modes
of operations, i.e., implementing arbitrary logic functions (heavy-weight compute mode),
TCAM function (light-weight compute mode), memory and routing. Previously, an 8Mb
multi-layered crossbar array using the T aOx -based 1D1R cell has been demonstrated by
Panasonic [65], which presents the details of fabricating RRAM crossbar.
The connection node is used for connecting WLs (BLs) of two adjacent crossbar arrays,
restoring small analog signals to full-swing digital signals for noise tolerance, and supporting

100

Tile
Physical View

1D1R Cell
Memory
Element
Access
Device
bit-line

word-line

Connection Node

Via

Configuration
Memory

Connection Node
DFF

S/A

Voltage
Driver

Data flow

Voltage
Driver

DFF

S/A

Figure 6.4: A conceptual diagram illustrates the Liquid Silicon architecture. 2 × 2 tiles
are drawn in the example. In one tile, the 1D1R-based crossbar array is stacked atop
connection nodes (CMOS circuits) and does not consume die area. The key building blocks
of one connection node is also drawn in the figure.

101

the operations of four configuration modes. The key building blocks in one connection
node are sense amplifiers (S/As), skippable flip-flop, skippable inverter, voltage driver and
configuration memories, as illustrated in Figure 6.4. Different from the large and powerhungry conventional current of voltage S/As, a compact and low-power RC-based S/A
(adapt the design in [81]) is employed to improve noise tolerance. It detects the small analog
signal changes on a WL (or a BL) in one array and generates a full-swing digital output to
drive the corresponding WL (or BL) in the adjacent array, and vice versa. The skippable flipflop is used to implement sequential circuits. The skippable inverter is included for logical
completeness. The voltage driver can be configured to generate different drive voltages for
four modes, controlled by the configuration memories. The configuration memories also
control the dataflow direction of the connection node, which is designed with two copies of
circuits to operate bidirectionally. We note that connection nodes can also be disabled if
not in use to save power. More details of the circuit design are presented in Section 6.2.4.

6.2.2

Configuration Modes

Light-weight Compute Mode
One tile in this mode can be configured as an embedded TCAM block that can be
used to implement high-performance parallel search or binarized network [28]. The idea is
inspired by the fact that a TCAM array based on non-volatile memory [81] has the same
physical design as a memory array and thus the crossbar array can be reused to implement
the TCAM function as a dedicated light-weight configuration mode.
Figure 6.5 provides an example of the parallel search operation. The two adjacent
RRAM cells are paired to store one bit of the word entry. Specifically, these two cells are
programmed into complementary states to represent logic 1 or 0, and both of them are
programmed into the high resistance state to represent X that can match with any input.
Then the search key (101 in the drawn example) is applied on WLs (or BLs), where one
bit of the search key is applied on two adjacent WLs (or BLs). This search key is then
compared with every data entry that is stored in the crossbar array in parallel. Finally, the
S/As output a match vector in which a logic 0 indicates a mismatch between the search
102

Store
Entry
100
110
001
101
000
11X

Search key 101
from upper tiles

Store
Entry

Match
Vector
0
0
0
1
0
0

Search key
101 from
right tiles

0 0 1 0 0 0 Match
Vector

Post-match operation (e.g., priority encoding)
implemented by adjacent tiles configured in
heavy-weight compute mode

Program RRAM
into low
resistance state

Figure 6.5: One tile in the light-weight compute mode supports the parallel search operation.
The data entries can be stored either row-wise (left) or column-wise (right). The matched
entry is highlighted in blue.
inputs and the stored entry, while a logic 1 indicates a match. The match vector can be
further fed to the adjacent tiles for post processing, such as priority encoding.
The TCAM implementation in Liquid Silicon provides three advantages over other standalone TCAM designs [81][85][19]. 1) It can implement TCAM blocks with different sizes
and/or aspect ratios by coalescing adjacent tiles. 2) It allows data to be stored in either a
row-wise or a column-wise manner. 3) It allows users to flexibly define their custom postmatch functions such as a priority encoder or a population counter by configuring adjacent
tiles into heavy-weight compute mode.
This mode also provides a native implementation of the binarized neural network (BNNs).
As illustrated in Figure 6.6, this is equivalent to a TCAM function. Specifically, the binary weights are stored in the crossbar arrays (like the stored data entry in TCAM), while
the input vector is applied on the WLs (or BLs) of the tile (like the search inputs). The
S/A implements the equivalent count, normalization and activation function in the neural
network.
Heavy-weight Compute and Interconnect Mode.

103

Store
Weight

Store
Weight

Input vector 101
from upper tiles

100
110
001
101
000
111

1
Compute
0
results
1
1
0
1

Program RRAM into
low resistance state
Configured one connection node to
output “1” if at least 2 bits of the
weight are same with input vector

Input vector
101 from
right tiles

1 0 1 1 0 1
Compute
results
Configured one connection node to output “1” if at
least 1 bit of the weight is same with input vector
Configured one connection node to output “1” if at
least 3 bits of the weight are same with input vector

Figure 6.6: The light-weight compute mode is also used to implement the binarized neural
network, and the data layout in one tile can be either horizontally or vertically.
The heavy-weight compute mode and the interconnect mode are described together due
to the similarities between them. In the Berkeley Logic Interchange Format (BLIF) [10], an
AND logic function can be represented by a combination of 0, 1 and -. For example, function
F = ABC is represented as 111, while F = ĀB̄ C̄ (or F = A + B + C) is represented as
000. In case of unused inputs, - will be used to mask them. For instance, when the inputs
are A, B, C and D, the function F = ĀB C̄ can be represented as 010-. We observe that
this representation is fully compatible with the TCAM function, i.e., we can apply the three
states in TCAM (0, 1, X) to represent the three states (0, 1, -) in BLIF to implement AND
logic functions, and use search keys to represent its logic inputs.
Based on this observation, the light-weight compute mode can be extended to the heaveweight compute mode, i.e., one tile can implement arbitrary combinational logic functions
in situ in the crossbar array without using ALUs. Sequential circuits can be implemented by
configuring the skippable flip-flop in the connection node. Figure 6.7 provides an example
to show that how to map a group of combinational logic functions onto one tile. Overall,
there are three key features for such crossbar-based logic implementation: 1) each data

104

ഥ 𝑩 𝑩
ഥ
𝑨 𝑨

ഥ 𝑩
𝑨 𝑨

Input from upper tiles
𝑭 = 𝑨𝑩

𝑭 = 𝑨𝑩

ഥ𝑩
ഥ
𝑮=𝑨

𝑪
ഥ
𝑪
𝑫
ഥ
𝑫

Outputs
to right
tiles
ഥ𝑫
ഥ
𝑯=𝑪

𝑰 = 𝑪𝑫

ഥ
𝑮=𝑨

𝑪
ഥ
𝑪
ഥ
𝑫

Program RRAM into low resistance state

ഥ𝑫
ഥ 𝑰=𝑪
𝑯=𝑪

Figure 6.7: The operation of the heavy-weight compute mode is illustrated (left) and four
logic functions are packed and mapped onto one tile. The operation of the interconnect
mode is illustrated (middle). These two modes can be co-existed in the same tile (right).
entry of a tile can implement a multi-input-single-output logic function (up to 256 inputs).
2) The logic inputs are shared among the logic functions in the same tile. 3) The data flow
direction can be controlled at a fine granularity of entry level (instead of tile level). The
inputs and outputs can be applied on either the top/bottom or left/right side of a tile.
The interconnect mode can be treated as a special case of heavy-weight compute mode,
in which one data entry implements a buffer function (F = A). As shown in Figure 6.7,
one input can be routed to any of the other three directions in a tile by programming the
crossbar array and the connection nodes accordingly. Additionally, this mode is able to
co-exist with the heavy-weight compute mode in the same tile (Figure 6.7).
Memory Mode.
In this mode, four adjacent tiles are used to implement a single-port memory block. As
illustrated in Figure 6.8, one tile is configured as a memory array and a small fraction of this
tile (∼ 4.7%) is used to implement the column address decoding logic. The remaining three
tiles are configured in the heavy-weight compute mode to implement row address decoder
and read/write column select logic. It might appear that using four tiles to implement one
memory block is inefficient. However, due to the ultra-dense array organization and much
simplified pitch match between the array and the periphery, the overhead is negligible.
Moreover, since all peripheral circuits are implemented in a soft-logic style (instead of
ASIC), users can flexibly adjust the logical aspect ratio of the memory array. Our custom
105

Read Column
Select Logic

Disabled
during
write

𝑑𝑎𝑡𝑎0

WE 𝐴0 𝐴1 WH

𝑑𝑎𝑡𝑎1
Column
Address
Decode
Logic

Memory
Array

Disabled
during
read

Row Address
Decode Logic

WE 𝐴0 𝐴1 WH

Memory
Array

WH
Write
Column
Select Logic

𝐴0
𝑑0
𝑑1

(a) Read operation

Row Address
Decode Logic

WH

𝐴0
𝑑0
𝑑1

(b) Write operation

Figure 6.8: An example illustrates (a) the read operation and (b) the write operation in
the memory mode. This memory block stores 4 2-bit words.

106

compilation tools support varying logical aspect ratios from 4b×15616 to 32b×1952 (61kb in
total) by default, and more logical aspect ratios can be realized by configuring the peripheral
circuits.
The read operation is performed in several steps, as illustrated in Figure 6.8a. First, the
row address is decoded by the row address decoding logic in the right tile, and the outputs
are sent the memory array (central tile). Meanwhile, the column address is also sent to the
column address decoding logic in the memory array in the same direction as the decoded
row address. After performing sensing by the connection nodes (highlighted in blue), the
read column select logic (NOR gates in the top tile) is applied on the outputs of the memory
array to generate the final read results. Note that, the write column select logic in bottom
tile is disabled during the read operation.
The write operation is performed in two consecutive steps to write logic 1 and logic 0,
controlled by the write high (WH) signal. In write operation, row address decoding logic
and write column select logic generates the appropriate drive voltage, based on the address,
WH, and the write data. Note that during write, the read column select logic and column
adderss decode logic are disabled (Figure 6.8b).

6.2.3

Comparison With FPGAs

Liquid Silicon shares some similarities with FPGAs in its reconfigurable data-flow architecture, but it also radically differs from FPGAs by providing the following features.
Hardware support for light-weight computation. Liquid Silicon provides native TCAM hardware support by virtue of a dedicated light-weight compute mode, whereas
FPGAs need to consume scarce on-chip memory resources to emulate the equivalent TCAM
function. Moreover, this mode also provides an efficient implementation of binarized neural networks, which nevertheless requires costly hardware resources (including both logic
and on-chip memory) in FPGA for the same purpose. Therefore, search-/data-intensive
applications can be performed more efficiently on Liquid Silicon than FPGAs.
Flexible memory blocks. Liquid Silicon provides a flexible memory support to better
customize the capacity and location of on-chip memory, depending on workloads. For
107

instance, one can configure more tiles into the memory mode to achieve high capacity and
in close proximity to compute units to better exploit data locality to save power. On the
contrary, memory blocks are hard-wired resource in FPGA and thus their capacity and
location cannot be changed after manufacturing.
Coarse-grained logic implementation As presented in the heavy-weight compute
mode (Section 6.2.2), tiles in Liquid Silicon supports logic functions with a large number of
inputs. To improve tile utilization (the percentage of resources of a tile used for mapping),
our compilation optimization has taken advantage of this architectural feature and employs
a coarse-grained logic implementation, i.e., applications are synthesized into complex logic
functions with larger granularity (∼ 30 inputs). On the contrary, logic implementation in
the FPGA is fine-grained, where applications are synthesized into a netlist of simple logic
gates (⩽ 6 inputs), and these gates are mapped onto 6-input lookup tables (LUTs) in an
FPGA. As compared with FPGA, the coarse-grained logic implementation in Liquid Silicon
results in shallower logic depth, less routing pressure (Figure 6.3b), better tile utilization
and thus higher performance and energy efficiency than FPGAs.
Fully exploit RRAM technology. Enabled by the nonvolatile nature of RRAM
technology, Liquid Silicon does not need to load bitstreams from external memory when it
is powered on, thereby reducing the configuration times and power. The nonvolatile nature
also improves its security as the bistreams are stored internally. It eliminates the security
vulnerability caused by the external bitstream loading process in FPGA which creates an
easily exploitable, non-invasive conduit by which the FPGA’s IP can be captured and copied.
Efficient resource partitioning. In Liquid Silicon, the hardware resources can be
flexibly partitioned by the compilation framework between logic and interconnect based on
the actual usage (Figure 6.3a), leading to better resource utilization than FPGA.

6.2.4

Circuit Implementation

Crossbar Array
In Liquid Silicon, each tile contains one crossbar array, which is built upon 1D1R cells
comprising a RRAM device and an access device placed at the intersection of a bit-line (BL)
108

and a word-line (WL), as illustrated in Figure 6.4. When one tile is configured to implement
one of the four modes (Section 6.2.2), the RRAM cells in the crossbar are programmed into
appropriate resistance states (LRS or HRS), and the BLs (or WLs) in the crossbar are
either driven or sensed by the connection nodes.
Proper voltages are applied on BLs (or WLs) to eliminate the sneak path leakage during
computation. More specifically, when one BL (or WL) is driven by the connection node, it
can have two voltage levels, i.e., Vinput0 and Vinput1 to represent logic 0 and 1, respectively.
When it is sensed by the connection node, the voltage on this line is between Vprecharge and
Vdischarge based on the sensing scheme (Section 6.2.4-Connection Node). To eliminate
the sneak path leakage, these voltages need to satisfy the following requirements3 .

(1)Vinput1 − Vinput0 < VT

(2)Vprecharge − Vdischarge < VT

(3)Vprecharge − Vinput0 > VT

(4)Vprecharge − Vinput1 < VT

Satisfying the requirement (1) means that when one BL and one WL are both driven by
the connection nodes, the access device (diode) at the intersection is turned off, therefore
no direct path is formed between two voltage sources (no static current). Satisfying the
requirement (2) indicates that when one BL and one WL are both sensed by the connection
nodes, the access device at the intersection is turned off, thereby disconnecting these two
lines. Requirements (3)-(4) are given by the sensing scheme, which will be discussed in
Section 6.2.4-Connection Node.
Based on these requirements, the voltages used in Liquid Silicon are Vinput0 = −0.3V ,
Vinput1 = 0V , Vprecharge = 0.5V , Vdisharge = 0.3V and VT = 0.6V .

Connection Node
The circuit implementation of one connection node is depicted in Figure 6.9. The
key building blocks are 1) sense amplifier (S/A) for sensing the voltage changes on the
connected BL (or WL), 2) configurable dynamic inverter to assist implementing the OR
3

VT is the turn on voltage for the access device, and Vdd = 1V under 45nm technology.

109

Sensing
C-Mem Clock

Reference
Timing
Signal

C-Mem

C-Mem

Clock

DFF

S/A

T-Gate

C-Mem

Clock

Driver 2

DFF

T-Gate

Transmission gate

C-Mem

DFF

Reference
Timing
Signal

C-Mem

DFF

C-Mem

Data-flow
direction

C-Mem

Driver 1

Sensing
Clock C-Mem
S/A

RRAM-based
configuration memory

T-Gate

Configurable
dynamic inverter

Figure 6.9: Detailed implementation of one connection node.
gates in the Sum-Of-Product (SOP) terms, 3) flipflops for implementing sequential circuits,
4) voltage driver to generate the required voltages for different modes, and 5) RRAM-based
configuration memories to control the various operations of the connection node.
Building Block - Sense Amplifier
The circuit implementation of the S/A is presented in Figure 6.10, which contains three
parts: 1) precharging circuit (P1 and the transmission gate), 2) discharging circuit (N1
and N2), and 3) inverters to generate a full swing output. To illustrate the sensing scheme
used in this S/A design, we assume that the data entry is stored on BLs, and inputs are
applied on WLs by driving them to Vinput0 or Vinput1 . Note that the same sensing circuit
and scheme can still be applied when the data entry is stored on WLs.
The sensing operation is controlled by the sensing clock and is performed in two stages:
precharge and evaluation. In the precharge stage, the BL is charged to Vprecharge through
the transmission gate, and the node SN is precharged to Vdd through P1. Then, in the
evaluation stage, BL is floating and starts to discharge (Figure 6.11b) at a rate depending
on the number of WLs that are pulled down to Vinput0 (requirement 3) and the resistance

110

Vprecharge
Vdd
Sensing
Clock
P1
Sensed
Line

Output
N2

N1

SN

Figure 6.10: The implementation of the S/A design.

Vinput1

Vinput1

(a)

Vinput0

BL0
BL1
Discharge current

RRAM in LRS

BL1
BL0
Time

(c)

Evaluation

Precharge

S/A output

Voltage on SN

(b)

Evaluation

Width indicates the
current strength

RRAM in HRS
Precharge

Precharge

Voltage on BL

Vinput0

BL0

BL1

Evaluation

BL1

(d)
BL0

Time

Time

BL1

BL0

Reference Timing Signal

(f)

Precharge Evaluation

(e)

Evaluation

Latched
output

Voltage on SN

Precharge

Time

BL0
BL1
Reference Timing Signal

Time

Figure 6.11: (a) The voltages on WLs and the RRAM states are presented. The corresponding discharge current for these two BLs are also drawn. (b) The voltages on these two
BLs. (c) The voltage on the node SN in the S/A. (d) The output of S/A. (e) The output
of the configurable dynamic inverter, and (f) this output is latched by the reference timing
signal.

111

Vdd
Sensing Clock

C-Mem

Connect to
adjacent node

Input

C-Mem

Output
Connect to
adjacent node

RRAM-based
configuration
memory

Figure 6.12: Circuit design of the configurable dynamic inverter.
states of RRAMs on the BL (Figure 6.11a). Note that, no discharging current flows through
the WLs that connected to Vinput1 (requirement 4) due to the isolation of the access device,
regardless of the resistance states of RRAMs. At the same time, node SN is also discharging
to ground since N1 and N2 are opened (Figure 6.11c), and the discharging rate is controlled
by the gate voltage of N2, i.e., the voltage of the BL. In a match case (BL1 in Figure 6.11),
SN has a higher discharging rate and the output switches to logic 1 in a shorter time,
compared to the mismatch case (BL0 in Figure 6.11d).
Building Block - Configurable Dynamic Inverter
The configurable dynamic inverters are included to improve the mapping quality. As
discussed in Section 6.3, applications are synthesized into SOP terms in the technology
mapping stage, where each SOP term comprises a group of AND gates and one OR gate.
While it is efficient to map the group of AND gates onto tiles, it leads to a low utilization
when mapping the OR gates onto tiles. More specifically, these OR gates do not share their
inputs with other logic gates, and when mapping them onto tiles, the WLs (or BLs) they
occupy cannot be utilized by other gates mapped in the same tile. This reduces the amount
of logic gates that can be mapped onto one tile and results in a degraded performance (e.g.
area). Using configurable dynamic inverters to implement these OR gates can improve the
mapping quality.
The circuit implementation of a configurable dynamic inverter is shown in Figure 6.12,
which contains one dynamic inverter and one NMOS. The dynamic inverter is controlled

112

by the same sensing clock that is applied to the S/A, and its operation is illustrated in
Figure 6.11e. The NMOS can be configured to connect the adjacent connection nodes,
therefore, a multi-input dynamic NOR gate can be formed among adjacent nodes.
Building Block - Flip-flops
As shown in Figure 6.9, one connection node contains two flip-flops for one data-flow
direction. One flip-flop is conditionally included to implement the sequential circuit. The
other one (marked in grey in Figure 6.9) is included to latch the output (Figure 6.11f) of
the configurable dynamic inverter, which is controlled by the reference timing signal. This
reference timing signal is locally generated by one reserved entry (WL or BL) in every tile.
This allows each tile to have its own reference timing control, therefore, it works for all
configuration modes without any modification.
Building Block - Driver Circuits
Driver circuits are included to generate the correct drive voltages based on the input
signal and the sensing clock. Controlled by one configuration memory, they also can be
disabled to save power. Two types of driver circuits are used in the connection node.
Driver 1 contains a negative voltage level shifter to generate the Vinput0 , and the driver 2
extends the driver 1 by adding a positive voltage level shifter to provide the write voltages
for the memory mode. Since the row address decode logic and write column select logic
(Figure 6.8) can only be placed on the right and bottom side of the memory array, only one
data-flow direction needs to use the large driver circuits (Driver 2), as shown in Figure 6.9.
Building Block - Configuration Memory
In Liquid Silicon, we also use RRAM devices to build non-volatile configuration memory
in the connection node. Each configuration memory is structured with a 3D2R cell, two
inverters, and two MOSFETs (Figure 6.13a) and can be organized as a crossbar array
(Figure 6.13b), by connecting the WLs (W LT and W LB ), BLs and LOAD lines. Information
is stored in each configuration memory cell using two RRAM devices, which are programmed
to have complementary states (one in HRS and the other in LRS). For example, the RRAM
113

a)

BL
Vdd/2
gnd

b)

LOAD
Vdd
gnd

gnd

gnd

2V/3

gnd

V

WLT
Vdd
gnd

Node A

Node S

I

2V/3
V/3

𝑆

𝑆ҧ

WLB
gnd
gnd

Loading
configuration

During running
application

V/3

Write one RRAM cell (blue)

Figure 6.13: (a) Circuit implementation of the non-volatile configuration memory, and
voltage setups for three operations are highlighted. (b) 3D2R cells can be organized in a
crossbar structure and the voltage setup to program one RRAM cell (in blue) is illustrated.
device in blue (Figur 6.13a) is programmed into LRS to store logic 1, otherwise it stores
logic 0.
To program the RRAM device in configuration memory, the “V/3” write scheme [20] is
used, and one example (Figure 6.13b) is given to illustrate the applied voltages for writing
one RRAM device (blue).
In the read operation, W LT is connected to Vdd , and W LB is connected to gnd. A
voltage divider is formed between two WLs, and the voltage on Node A (Figure 6.13a)
is determined by the resistance states of the two RRAM devices. The degraded voltage
level on Node A due to the limited resistance ratio is restored to a full voltage swing by
the inverters to generate final digital outputs. In addition, Vdd /2 is applied on BL to turn
off the access device in red (Figure 6.13a), thereby disconnecting 3D2R cells from BL and
isolating them from each other. Finally, the LOAD line is connected to Vdd , and the stored
configuration bit is loaded into the storage node (Node S).
During normal operations, WLs, BL and LOAD line are all connected to gnd (Figure 6.13a). Therefore, 3D2R cells have zero standby power consumption, and configuration
bits are retained without the need of an external power supply.

114

Sensing
Clock
Time

Critical Path

Critical Path

Time

(a)
Connection node that is
performing sensing operation

(b)
Connection node that is not
performing sensing operation

Signal propagates
through 4 nodes

Figure 6.14: One example illustrates sensing operations when providing (a) one sensing
clock or (b) two sensing clocks.
Power-Saving Techniques
S/A is the most power consuming part in the connection node because it needs to
frequently precharge the BL or WL (large capacitance). To reduce the power, we apply two
techniques when designing the S/A. The first technique is that we reduce the precharging
voltage, i.e., using Vprecharge instead of Vdd . In addition, instead of a rail-to-rail voltage
swing, the voltage on the BL or WL is a small voltage difference between Vdischarge and
Vprecharge , which is 0.2V in this design. This technique reduces the power consumption for
one precharging operation.
The other technique is that, instead of only having one sensing clock, multiple sensing
clocks that have same the frequency but different phases are provided, and S/As choose
one of them to perform the sensing operation. This can reduce the operation frequency
of the S/A, thereby reducing the power consumption, as illustrated in Figure 6.14. More
specifically, one critical path of the mapped application can have n connection nodes. If
only one sensing clock is provided, then in order to run the application under frequency
f , the sensing clock frequency needs to be nf . On the contrary, if two sensing clocks are
provided, then the sensing clock frequency can be reduced to nf /2. More sensing clocks

115

Connection nodes are fully buried
underneath the crossbar array
87.24µm
43.92µm× 0.34 µm
87.87µm

Connection Node

Figure 6.15: Physical design of a tile under 40nm technology.
lead to lower power consumption, but they require more multiplexers and configuration
memories for selecting clocks (large area). In this design, we choose to include two sensing
clocks which reduces the power consumption of Liquid Silicon without noticeably increasing
the design complexity.
Physical Design
As described in Section 6.1.1, the fabrication process of the RRAM and access device is
BEOL compatible. Therefore, the crossbar arrays are implemented on upper level metals
i.e., M3 and M4 layers while the connection nodes use lower level metals, i.e., M1 and
M2 layers for local routing and are buried underneath the arrays. We perform layout
optimization by judiciously increasing the pitch size of the crossbar array, which will ease
the placement of connection nodes fully below the array to achieve the minimal Si area.
Additionally, we share common circuits (e.g. S/A and voltage driver) as much as possible
between tiles to further reduce the area. As shown in Figure 6.15, we complete the physical
design using 40nm CMOS technology. The area of one tile is measured to be 87.87µm ×
87.24µm. The physical design information will be used for the evaluation in Section 6.6.1.
Configuration and Other Issues
In Section 6.2.4-Connection Node, we discussed the methods to program configuration
memory in the connection nodes. In this Section, we present the method to program crossbar

116

array.
In Liquid Silicon, buffers are inserted between crossbar arrays, which can be configured
to connect or disconnect multiple WLs (or BLs) in the adjacent tiles when programming
Liquid Silicon. Logically, Liquid Silicon only contains one crossbar array during configuration, and it can be programmed by the widely used write scheme (e.g. “V/3” or “V/2” write
scheme)[20]. Additionally, write-and-verify schemes can be applied to program RRAM cells
into the required resistance level. Physically, WLs and BLs are separated by the buffers,
therefore the configuration process does not have severe IR-drop issue. The configuration
bitstream is generated by the compilation tool (Section 6.3) in an offline process.
With the manufacturing process getting mature, there has been steady improvement
in RRAM technologies based on engineering approaches. Rather than taking over the
dominant markets of incumbent technologies such as DRAM or FLASH, a relative lowhanging fruit is to apply RRAM to Liquid Silicon, as it has less stringent requirements on
endurance, write speed and power, etc., from the technology point of view. For RRAMs of
interest, 108 cycles of endurance is likely sufficient to sustain the life time of Liquid Silicon,
which is quite achievable in commercial products. For the memory mode, since any tile can
be configured as memory blocks, wear leveling can be performed by optimally placing the
memory blocks, thereby reducing the pressure on endurance. Evaluation of this technique
will be our future work. High write power and low write speed per bit compared to SRAMs
is less of a concern, as configuration is not done as frequently as updates on main memory
and not all applications utilize the memory mode.

6.3

Custom Compilation Framework

Each Liquid Silicon contains hundreds of thousands of tiles with Gb-scale RRAM that
needs to be configured to run an application. Therefore, it is intractable to custom-tailor
each memory element for application mapping. To address this issue, a custom compilation
framework is presented to facilitate application development for Liquid Silicon.
Figure 6.16 depicts the compilation framework for Liquid Silicon. It comprises a flexible

117

Application

Front-end

Back-end

High-level Synthesis

Parser
Verilog
RTL

TensorFlow, OpenCL …

Place
Route

Adaptive
Resource
Partition

Technology
Mapping

Add support for:
1) Coarse-grained
logic implementation

Place&Route

Add support for:
1) Adaptive Resource Partition

Bitstream

Resource used for:
Heavy-weight compute

Add support for:
1) Light-weight computation
2) Flexible memory blocks

Output to
configure L-Si

Interconnect

Figure 6.16: Workflow of the compilation framework. The back-end is modified to support
the features provided in Liquid Silicon.
front-end that supports a wide range of popular high-level programming languages and
frameworks, and a custom back-end that can fully exploit the low-level architectural features
of Liquid Silicon to achieve optimal code mapping on target hardware. More specifically, the
front-end takes an application written in high-level programming languages/frameworks as
an input and translates it into synthesizable Verilog RTL code. The front-ends that generate
a common code representation in Verilog RTL can be integrated into this framework. Thus,
this compilation framework is reusable and extendable to other front-ends.
The back-end further synthesizes the Verilog RTL code into bitstreams, which are used
to configure Liquid Silicon. This custom back-end is adpated from one of the most popular
open-source retargetable toolchains — Verilog to Routing (VTR) [87][96] that has been developed for mapping applications written in Verilog onto FPGA. Nevertheless, several major
modifications are made to VTR to account for the fundamental architectural differences between FPGA and Liquid Silicon for optimal code mapping. As shown in Figure 6.16, the
back-end contains three stages, i.e., parser, technology mapping and place&route. Specifically, the parse from VTR is modified to support the four configurations. The technology
mapping tool from VTR is modified to realized the coase-grained logic implementation.
The place&route tool in VTR is replaced by a custom tool that fully utilize the unique flex118

ibility provided by Liquid Silicon. The key modifications are highlighted in the following
discussion.
Compiler support for light-weight compute mode. As VTR is originally developed for FPGAs, which do not have dedicated configuration support for light-weight
computation (Figure 6.5), we modify it to add compiler support for such a new feature in
Liquid Silicon. More specifically, we add two new Verilog modules for TCAM and BNN
that can be instantiated in the Verilog RTL code. The parser is modified to identify the
instantiations of these modules and convert them into a logical netlist. In case the size of
some modules exceeds a predefined threshold, the parser will split it into multiple smaller
ones to ensure it is physically realizable. The place&route tool then takes the netlist as an
input and maps it onto physical tiles.
Compiler support for flexible memory blocks. Although VTR provides compiler
support for mapping memory modules on FPGA, the mapping is subject to the constraint of
physical hardware IP blocks with fixed size and location. Here, we modify it to better exploit
the flexibility of Liquid Silicon. Specifically, an additional parsing step, which is needed in
VTR to split a large memory module into smaller ones in order to fit into physical BRAM
blocks with fixed size and location, is no longer needed in the case of Liquid Silicon, whose
architecture naturally supports flexible size and location.
Compiler support for coarse-grained logic implementation. VTR has a finegrained logic implementation in the technology mapping stage, which synthesizes the logic
netlist into simple logic gates with ⩽ 6 inputs. To adapt VTR to Liquid Silicon, we modify
the technology mapping tool and use the cut enumeration with the priority cuts algorithm [94] to pack the simple logic gates into large clusters (e.g. complex logic functions
with ∼ 30 inputs), thereby improving the tile utilization.
Compiler support for adaptive resource partitioning. VTR utilizes dedicated
hardware (CBs and SBs in Figure 2.1) for routing, while in Liquid Silicon, routing is flexible
and can co-exist with the heavy-weight compute mode within a tile. As such, the custom
place&route tool exploits the unused portion of a tile to perform routing. More specifi-

119

cally, we propose a technique called Adaptive Resource Partition to partition the hardware
resources in one tile between heavy-weight compute mode (logic) and interconnect mode
(routing) to achieve high tile utilization.

6.3.1

Adaptive Resource Partition

The custom place&route tool utilizes the simulated annealing algorithm [127] to place logic
functions into tiles and routes the interconnections between them. In order to adaptively
adjust the resource provisioning between logic and routing for each tile, this custom tool
perform place and route simultaneously, rather than sequentially as in FPGA place&route
tool. Specifically, given a new placement, the routing paths for all interconnection nets are
generated. Then the resource provisioning between logic and routing in each tile is updated
based on the actual usage, and the placement is updated based on the new resource provisioning. As a result, the resource provisioning in each tile can be independently controlled
and adjusted.
New Cost Function: To achieve this adaptive resource partition, a new cost function is designed to account for both place and route. Based on the negotiation idea in
PathFinder [91], the cost function for a given logic primitive i (e.g., a logic function) is
given by:
Cost = (B − A(i)) × f (i) + B × C × g(i)
where A(i) is the length of the longest path that contains logic primitive i, B is the length
of the critical path. The term f (i) represents the cost of placement and is related to the
utilization of the tile in which logic primitive i is placed. More specifically, its value is
chosen to be small, e.g., 0.01 if that tile is not over utilized. Otherwise, it is a large positive
number. The term g(i) is the routing cost of logic primitive i, and its value is the average
length of routing paths. The value of the parameter C is chosen to be 1 initially, and
decreases if any tile is over utilized. The total cost is the summation of costs for all logic
primitives in the netlist.
With this new cost function, we can optimize delay on the critical path as the term

120

(B − A(i)) becomes zero. On the contrary, on the non-critical paths, it tends to place logic
primitives in tiles with low utilization and thus optimizes area. Therefore, we can achieve
the best trade offs between performance and area.

6.4

Chip Demonstration

A test chip is fabricated to demonstrate the Liquid Silicon architecture. In this test chip,
each tile is structured with a 1T1R memory array (not 1D1R array to reduce the fabrication
difficulty) and a set of connection nodes, as illustrated in Figure 6.17. The schematic of
the 1T1R memory cell is shown in Figure 6.17b, which contains an access transistor and
an RRAM element. The 1T1R memory array is identical to the array structure in the
conventional nonvolatile memory design while the array-to-array interconnection is radically
different from that in the conventional memory. In particular, the word line (WL)/bit line
(BL)/source line (SL) connections to adjacent tiles are realized via the connection nodes
which comprises CMOS circuits to support the essential operations of Liquid Silicon. Note
that when making the tile-to-tile connection, adjacent tile is rotated by 90 degree to make
the data flow between tiles easier and each connection node is responsible for connecting
the BLs of the memory array in one tile to the WLs of the memory array in the adjacent
tile, and drive the corresponding SLs. Since the adjacent tile is rotated by 90 degree, one
WL in Liquid Silicon can either select 1) a row of RRAM cells, or 2) a column of RRAM
cells, depending on the orientation of the tile. This is different from that in conventional
memory, where one WL is used to select a row of cells.
Figure 6.17c shows the key building blocks of a connection node, which contains separate
circuits for read (Section 6.4.1) and write operations (Section 6.4.2) respectively. The read
circuits include a sense amplifier (SA), an inverter, a flip-flop, a low-voltage driver (LVdriver) and two multiplexers (controlled by two 1-bit configuration memories). The write
circuits include high-voltage drivers (HV-driver) based on thick-oxide FETs, special registers
(2-bit WL Sel, 1-bit Bit Mask and 2-bit Data) and some decoding logic. The SA, LV-driver
and HV-drivers are connected to the BL/SL/WL by the write enable signal, depending on

121

Tile

(a)
Read
Circuits

CLK

BL

P1

SA

HVdriver

(b) 1T1R Cell
WL

BL

SL

DFF

WL_Sel

LVdriver

WL

HVdriver

Bit Mask
HVdriver

Data

Access
Transistor

Write
Circuits

RRAM
SL

Controlled by
Write Enable

(c) Connection Node

Configuration
Memory

Figure 6.17: (a) This Liquid Silicon test chip comprises a 2D array of identical tiles, and each
tile contains a 1T1R memory array and several connection nodes. Note that the adjacent
tile is rotated by 90 degree. The pitch mismatch between WL and BL can be resolved in
the connection node through a two-metal transition routing network. (b) The schematic of
a 1T1R memory cell, and (c) key building blocks of the connection node are drawn in the
figure.

122

P1

BL

BL0

CLK

SA

DFF

WL

LV-driver

SL0

BL1

SL1

VDD

WL0

GND
HV-driver

WL_Sel
Bit Mask

SL

HV-driver

SA

Data

(a)

RRAM in Low
Resistance State

RRAM in High
Resistance State

Transistor is off

SA

(b)
Precharge

Evaluation Precharge Evaluation

V

V

Vth

Vth

Small current

Reference
Timing

0
t

Voltage
On BL
t

t

Large current

V

Transistor is on

WL1

HV-driver

V

Reference
Timing

1 Sensing
Output
t

Figure 6.18: (a) The read data path for the sensing operation is drawn in the figure. (b)
The conceptual diagram illustrates the sensing operation.
the read or write operation. A scanchain (not drawn in the figure for simplicity) is used to
load data into the configuration memories and the special registers.

6.4.1

Operational Modes

This section first describe the read/sensing operation that is most performance critical and
commonly used in all operational modes, and then present the details of each mode.
The read data path to perform the sensing operation is shown in Figure 6.18a. The
BL is connected to the SA, and the WL and the SL are connected to the LV-driver and
ground respectively. We apply the same circuit as in [81] to implement the SA to achieve
low power, high speed, and good noise tolerance. The sensing operation is controlled by the
sensing clock (CLK) and is performed in two stages: precharge and evaluation, as shown
in Figure 6.18b. In the precharge phase (CLK=0), all WLs are connected to ground to
turn off the access transistors in the 1T1R cells, and the BL is charged through P1. In the
evaluation phase (CLK=1), BL is floating (P1 is off), and WLs are either connected to VDD
or GND, depending on the input values. The BL then starts to discharge to ground at a
rate depending on the number of low-resistance pull-down paths. Large discharging current

123

occurs if an RRAM element is in the low resistance state (LRS) and the corresponding access
transistor is turned on. If the BL voltage drops across a certain threshold (Vth) before the
reference timing, the SA outputs a ‘0’ (BL0 in Figure 6.18b). Otherwise, it outputs a ‘1’
(BL1 in Figure 6.18b). A redundant BL is reserved in each tile to generate the reference
timing signal to better track process, voltage and temperature (PVT) variations. As a result
of the analog nature of RRAM, the reference timing (the falling edge of the SA output) is
highly tunable as further adjustment can be performed via programming of RRAM cells
on this BL. In the current setup, the reference BL is monitored and programmed by the
external tester through write-verify operations.
The same sensing operation is performed across all operational modes. The key difference among these operational modes is the different data encoding of the inputs applied
on WLs, the resistance values stored in the resistive elements and the SA’s outputs. Based
on the value stored in the configuration memories, the sensing result (SA’s output) can be
inverted and/or latched before sending to the next tile. The sensing outputs of a tile are
applied to drive the WLs of the memory array in the adjacent tile. More detailed discussion
is presented in the following four operational modes.
Heavy-weight Compute and Interconnect Mode
An arbitrary multi-input-single-output logic function is implemented by one BL in this
mode. The inputs to the logic function are applied on the WLs, and sensing operation is
performed to generate the output of this function on the BL. In this operational mode, logic
input values are encoded into different voltage levels to ensure a correct operation. In particular, a logic input ‘0’ is encoded into a voltage level of VDD on a WL, while a logic input
‘1’ is encoded into GND. In addition to the inputs, the type of logic operations is encoded
into the corresponding resistance values. For instance, Figure 6.19a shows the implementation of a three-input AND function (Out = ABC). The RRAM elements associated with
the WLs that carry the inputs to this logic function are programmed into LRS, while other
RRAM elements are written into high resistance state (HRS). As shown in Figure 6.19a,
if any of the inputs A, B or C is logic ‘0’ (apply VDD on the corresponding WL), the BL
124

Logic
Input
𝑨=𝟎

Logic
Input
SL

BL

VDD

𝑨=𝟏

WL0

GND

WL1

GND

V

WL2

SA

𝑶𝒖𝒕 = 𝑨𝑩𝑪
=𝟎

WL0

GND

WL1

GND

Voltage
t on BL
Reference
Timing

WL2

SA

V
Vth

𝑶𝒖𝒕 = 𝑨𝑩𝑪
=𝟏
Voltage
t on BL

Reference
Timing

Output

SL1

BL2

SL2
WL0

WL1

VDD

(c)

RRAM in HRS

SA

𝑶𝒖𝒕𝟎 = 𝑨𝑩
=𝟏
V

BL0

V
1

WL2

V
Vth

Reference
Timing

Output

t

t
RRAM in LRS

BL1

𝑪=𝟎

GND

V
0

SL0

𝑩=𝟏

GND

(b)

BL0

𝑨=𝟏

𝑪=𝟏

𝑪=𝟏

V
Vth

SL

BL

GND

𝑩=𝟏

𝑩=𝟏

(a)

Logic
Input

Out0
Transistor is off

V

t

𝑶𝒖𝒕𝟏 = 𝑩𝑪
=𝟎
V

BL1

t
1

SA

Reference
Timing

Out1

Transistor is on

SA

t

BL2

t
V
0

𝑶𝒖𝒕𝟐 = 𝑨
=𝟏

Reference
Timing

t

Out2

Large
current

1

t
Small
current

Figure 6.19: (a, b) An arbitrary AND function is implemented on one row (BL). (c) Multiple
functions are implemented in one array with a compact mapping.
will have a large discharging current, which will be detected by the SA, outputting a ‘0’.
Otherwise, if all the inputs are logic ‘1’, the SA outputs a ‘1’ (Figure 6.19b). By controlling
the configuration memory, the sensing result can be further inverted to implement a NAND
function on one BL, and it can also be latched to implement a sequential logic.
Multiple logic functions can be implemented in one tile, as shown in Figure 6.19c. A
shared logic input (e.g., input B) is applied on the same WL to improve the array utilization.
The unused inputs can be simply masked out by programming the corresponding RRAM
cells into HRS. For instance, the logic functions of Out0 = AB and Out2 = A do not have
input C. Although applying a logic ‘0’ to input C would turn on all access transistors
in the corresponding row and discharge all three BLs, the BL0 and BL2 only have a small
discharging current, thus, the corresponding SAs will generate the correct outputs (Out0 = 1
and Out2 = 1). Moreover, one BL can also be used to route a logic signal by implementing
a one-input buffer function (Out2 = A). This is used in the Memory mode to route the
column address.

125

Memory Mode
Three adjacent tiles are utilized to implement a memory block (Figure 6.20a). The left
tile operates in the Heavy-Weight Compute Mode to implement the row address decoder
and route the column address to the central tile. The central tile stores data and implements
the column address decoder. The bottom tile also operates in the Computation mode to
implement the selection logic and generate the read result. The data and input encoding
in the left/bottom tile are the same as that in the Computation mode. For the central tile,
selecting a row for read is encoded into applying a voltage level of VDD on the corresponding
WL, while GND is applied on the WLs of the unselected rows. Data bit ‘1’ is encoded as
programming an RRAM element into HRS, while data bit ‘0’ is encoded as programming
an RRAM element into LRS.
Figure 6.20 provides a conceptual diagram to show the implementation of a memory
block that stores 16 2-bit words. The memory address A[3 : 2] is used as the row address,
while A[1 : 0] is used as the column address. The left tile implements the logic functions to
decode the row address (Figure 6.20b). Based on the decoded results, appropriate voltages
are applied on the WLs of the central tile (Figure 6.20c). The column address is then
decoded by the central tile. As shown in Figure 6.20c, the BL of an unselected column has
a large discharging current flows through the 1T1R cells in the address decoder region, thus,
the SA always outputs logic ‘0’ for the unselected column, independent of the data value
stored in the data array region. For the selected column, the sensing result is determined by
the RRAM state in the selected 1T1R cell, and the SA will output logic ‘1’ if this RRAM
is in HRS (store bit ‘1’), otherwise output ‘0’ (store bit ‘0’). Finally, the outputs from the
central tile are ORed by the selection logic in the bottom tile to generate the read result
(Figure 6.20d). The timing diagram of the read operation is shown in Figure 6.21. For
the write operation, in principle, we can apply the same decoding scheme as read operation
to write data into memory. Nevertheless, due to die area constraint, we choose to use the
write data path as described in Section 6.4.2 to write data into the memory block in this
test chip.

126

𝟎

𝟎

𝟎

𝟎

𝟎

𝟏

𝟎

𝟎

(d) Bottom
Tile

GND

GND

VDD

GND

GND

GND

GND

SA

𝟎

𝑺𝟑
(a) Memory Block

SA

(d) Selection
Logic

𝟏

𝑺𝟒
𝟎

Row Address Column Address

𝟏

𝟏

𝟎

𝟏

𝟎

𝟎

𝟏

VDD GND GND VDD GND VDD VDD GND
SA

(b) Left Tile

Row
Address
Decoder

𝑨 𝟑 + 𝑨[𝟐] = 𝟏

𝑨 𝟑 + 𝑨[𝟐] = 𝟏

Unselected
Row

𝑺𝟎

𝑨 𝟑 + 𝑨[𝟐] = 𝟎

SA

𝑨[𝟑] + 𝑨[𝟐] = 𝟏

𝑨 𝟑 + 𝑨[𝟐] = 𝟎

𝑨[𝟑] + 𝑨 𝟐 = 𝟏

Route
Column
Address

𝑨[𝟏] = 𝟏

Read Result

(c) Data
Array &
Column
Address
Decoder

GND

(b) Row
Address
Decoder

Selected
Row

𝑺𝟏

𝑨[𝟏] = 𝟎
SA

𝑨[𝟎] = 𝟎

𝑨[𝟎] = 𝟏

𝑨[𝟎] = 𝟏

Route Column
Address

𝟏 GND
𝟎 VDD
𝟏 GND

𝟏

Data Column Address
Decoder
Array

Column
Decoded
Address Row Address

(c) Central Tile
Selected Columns

𝟎
𝟏
𝟏
𝟏
𝟎
𝟎
𝟏

𝟎 𝟎 𝟎 𝟎 𝟎 𝟏 𝟎 𝟎

𝑺𝟐
RRAM
In LRS

RRAM
in HRS

𝟏 GND

𝟏 GND
𝟎 VDD
𝟎 VDD
𝟏 GND
SA

SA

Unselected
Column 𝟎

Selected
Column 𝟎

Transistor
is off

Transistor
is on

Large
current

SA

Selected
Column 𝟏
Small
current

Figure 6.20: (a) Three adjacent tiles are used to implement a memory block that stores
16 2-bit words. (b) The left tile implements the row address decoder and routes column
address to the central tile. (c) The central tile implements the column address decoder and
stores data. (d) The bottom tile implements the selection logic to generate the read result.

127

CLK

𝐀[𝟑]

Row
Address

𝐀[𝟐]
𝐀[𝟏]

Column
Address

Address decoded
by left tile

𝐀[𝟎]

S0=A[3]+A[2]
S1=A[1]

Sensing results
generated by central tile

SA output of
central tile S2
Read
result

𝑺𝟑
𝑺𝟒

Final read results
generated by bottom tile

Figure 6.21: The timing diagram of the read operation.
We made several design decisions in this Memory mode. At first, we choose to use
the left/bottom tile to implement peripheral circuits (e.g. address decoder) rather than
using dedicated CMOS-based circuits for two reasons. i) It allows us to flexible select the
number of row/column address bits to implement memory blocks, as compared with hardwired CMOS decoding circuits which only support fixed row/column address bits. ii) The
CMOS-based peripheral circuits are only useful for this Memory mode and will be wasteful
when a tile operates in other three modes.
Secondly, with such flexibility, we need to judiciously select the number of row/column
address bits to maximize the achievable memory capacity. Figure 6.22a shows an example
in which the optimal address selection achieves the memory capacity of 16 2-bit words, as
compared with a non-optimal one which can only store 12 2-bit words given the same area
(Figure 6.22b). Since the number of possible selections is limited (< 8), the compilation
framework can examine all the cases to find the optimal one. It is also worth noting
that the optimal address selection varies with the word size. For instance, the optimal
selection of a memory block with a 4-bit word size (3-bit row address and 1-bit column
address, Figure 6.22c) is different from that of a memory block with a 2-bit word size (2-bit
column/row address, Figure 6.22a).

128

Decoded Row
Address
Column Address
A 2-bit word
A 4-bit word
Column
Address
Decoder

2-bit row address
2-bit column address
16 2-bit words

(a)

3-bit row address
1-bit column address
12 2-bit words

(b)

3-bit row address

RRAM in LRS

12 4-bit words

RRAM in HRS

(c) 1-bit column address

Figure 6.22: (a) An optimal selection of the row/column address bits leads to a 32-bit (16
2-bit words) of memory capacity, as compared with (b) a non-optimal one which leads to a
24-bit (12 2-bit words) of memory capacity given the same area. (c) The optimal address
selection for a memory block with a 4-bit word size, which achieves 48-bit (12 4-bit word)
of memory capacity. The left/bottom tiles are not drawn in this figure for simplicity.

𝐒[𝟎] = 𝟏
𝐒[𝟏] = 𝟎

Search Input (S[1:0] = 01)

BL0

SL0

BL1

SL1
WL0

VDD
Bit ‘X’

Bit ‘1’
WL1

GND

WL2
Bit ‘0’

SA

Voltage

Bit ‘1’

Vth

Voltage

t on BL0

GND

VDD

Vth
V

Reference
Timing

V
t

WL3

SA

t on BL1

Output ‘1’
Match

RRAM in Low
Resistance State

Transistor
is off

RRAM in High
Resistance State

Transistor
is on

Reference
Timing

Output ‘0’

t Mismatch
Large
discharging
current
Small
discharging
current

Figure 6.23: One tile can be configured to perform parallel search operations.
Finally, another optimization we performed is on the column address decoder to maximize the resource utilization. As shown in Figure 6.20c, the column address decoder occupies
multiple rows in the central tile, which reduces the number of rows that can be used for data
storage. Thus, it is desirable to minimize the number of rows used for column decoding.
As such, we choose the decoding scheme by routing the column address to the central tile
which only occupies 2 × N rows for an N-bit address. In contrast, if the column address is
decoded in the left tile (same as the row address), the decoder will occupy 2N rows for an
N-bit address, resulting in a significantly reduced memory capacity.

129

Light-weight Compute Mode
One tile can implement parallel search operations for pattern matching (Figure 6.23).
This mode is critical for supporting big data applications that are often search-intensive.
In this mode, the search keys are applied on WLs and sensing operation is performed to
generate search results by comparing the data entries stored in the memory array with the
incoming search keys. The SA outputs a ‘1’, indicating a match and vice versa. To support
search, the data encoding in this mode is very different from that in Heavy-weight Compute
and Memory modes. In this mode, a logical grouping of two adjacent WL inputs and two
adjacent memory cells are applied to encode a single bit. As shown in Figure 6.23, one bit
of the search key is encoded into a complementary voltage levels applied on the pair of WLs
(e.g. bit ‘0’ is encoded as applying GND on the top WL and VDD on the bottom WL). One
data entry is stored in one column. RRAM elements in the pair of two cells are programmed
into complementary states to store bit ‘0’ or ‘1’, or are both programmed into HRS to store
“don’t care” state (denoted by an ‘X’). As shown in Figure 6.23, if there is a mismatch
between the stored entry and the search key, the BL (BL1) has a large discharging current,
which is detected by the SA to output ‘0’ (mismatch), otherwise, it outputs ‘1’ (match).
One tile can also implement binarized neural networks (BNNs). The input vector is
applied on the WLs, and the sensing operation is applied to perform the BNN operations,
i.e., bit-wise XNOR, population count and activation. The data encoding in this mode is
similar to that in Search. We logically group two adjacent WL inputs and two adjacent
memory cells to encode a single bit of value. As shown in Figure 6.24, applying complementary voltage values (VDD and GND) on adjacent WLs is used to encode input ‘0’ and ‘1’.
Each weight vector is stored in one column and two adjacent 1T1R cells are programmed
into complementary states to store a single-bit weight value. The XNOR count and activation operations are implemented by the SAs in the connection nodes. During sensing,
the BLs are first precharged and the input vector is applied on the WLs. Each BL then
performs a bitwise XNOR, and discharges at different rates (represent the XNOR count
result), which is detected by the SA to generate the output. Specifically, the SA will output

130

In[𝟎] = 𝟏
In[𝟏] = 𝟎

Input (In[1:0] = 01)

BL0

SL0

BL1

SL1

V

WL0

VDD
Bit ‘1’

Vth

Bit ‘0’

GND

WL1

V

Slope represents the
XNOR count result

WL2

GND
Bit ‘0’

VDD

Bit ‘1’

Weight
Vector

SA

1

t

WL3

SA

Vth

t

BL0
Reference
Timing

V
BL1

t

Reference
Timing

Adjust reference V
timing to implement
activation function

0 t

RRAM in Low
Resistance State

Transistor
is off

Large discharging
current

RRAM in High
Resistance State

Transistor
is on

Small discharging
current

Figure 6.24: One tile can be configured to implement binarized neural networks.
‘0’ if the discharging rate exceeds a given threshold, otherwise, it will output ‘1’ (equivalent
to performing activation function). This threshold can be adjusted by tuning the reference
BL timing to implement different activation functions. More details on tuning the reference
timing can be found in the description of the sensing operation in Section 6.4.1.

6.4.2

Write Operation

The write data path to program a selected 1T1R cell is shown in Figure 6.25a. The
BL/SL/WL are connected to the HV-drivers, which contain level-shifters and multiplexers to connect WL/BL/SL to different voltage supplies (0.8V, 2V, 2.5V, 3.2V, 4V) based on
the value stored in the registers. To set a selected RRAM element into LRS (Figure 6.25b),
2V is applied on the WL of the selected row by loading ‘01’ into the corresponding WL Sel
register. While other unselected WLs are connected to ground by loading ‘00’ into the
WL Sel registers. For the selected column, ‘1’ is loaded into the Bit Mask and ’00’ is loaded
into the Data register. Then the HV-drivers apply a 2V voltage pulse (100ns) on the SL and
connect the BL to ground for this selected column. A current flows through the selected
RRAM element to set it into LRS. For the unselected columns, ‘0’ is loaded in the Bit Mask,
and both BL and SL are connected to ground. No current flows through the unselected
1T1R cells and their resistance states remain unchanged, since 1) the access transistor is
turned off for the unselected rows, and/or 2) no voltage drop across BL and SL for the unselected columns. To reset a selected RRAM element into HRS (Figure 6.25c), ‘10’ is loaded

131

HV-driver

WL_Sel

HV-driver

HV-driver

Bit Mask Data Bit Mask
0
01
1

2.5V GND GND

Data
-

(d)

BL0

SL0

BL1

Data
-

GND

BL0

SL0

GND

BL1

SL1

BL
(V)

SL
(V)

10

0

4

01

2.5

0

1

00

0

2

0

-

0

0

WL

Bit
Mask

Data

11

0.8V

1

10

3.2V

1

01

2V

00

0V

GND

SL1

2V

GND

WL_Sel

3.2V

WL_Sel 10
WL_Sel 00

Data

GND

(c)

WL_Sel 00

Bit Mask

SL

Bit Mask Data Bit Mask
0
00
1

2V

DFF

SA

WL

LV-driver

WL_Sel 01

BL

(b)

(a)

CLK

GND

P1

Transistor is off

Voltage Pulse

Transistor is on

Set/Reset Current

Selected Cell

Figure 6.25: (a) The write data path for programming a selected 1T1R cell is drawn in
figure. The conceptual diagram illustrates the operation to (b) set the selected RRAM into
LRS, and (c) reset the selected RRAM into HRS. (d) The output voltages of HV-drivers
are summarized in the table.
in the WL Sel register to apply 3.2V on the WL of the selected row. The bits stored in the
Bit Mask registers are the same as that in the set operation, but ‘01’ is loaded into the Data
register for the selected column. Then the HV-drivers apply a 2.5V voltage pulse (100ns)
on the BL and connect SL to ground. The current flows through the selected RRAM in the
opposite direction (as compare with the set operation) to reset it into HRS. The setup for
forming a selected RRAM element is similar to that of the set operation, but it loads ‘11’
into the WL Sel register of the selected row and ‘10’ into the Data register of the selected
column. This applies 0.8V on the corresponding WL and a 4V voltage pulse (40µs) on the
corresponding SL. Output voltages of the HV-drivers are summarized in Figure 6.25d.

132

12.6

20

0

40

60
80
100
Resistance (kΩ)

120

140

Match: The input search key is identical to the stored data entry.
1-bit Mismatch: Only one bit in the search key is different from
the data entry.

Power Efficiency
(TOPS/W)

Frequency

𝟐𝟓𝟔 × 𝟐𝟓𝟔 Array

4.9

60
50
40
30
20
10
0

30
25
20
15
10
5
0

(a)

Array Efficiency (%)

10
5

(c)

40
30

10
0
Array Size

Array Size
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

Machine
Learning

Array Size

Array Size

160

50

20

Array Size

Power Efficiency
(Tb-Search/s/W)

Frequency

𝟏𝟐𝟖 × 𝟏𝟐𝟖 Array

15

0

(b)
7.8

20

Area Efficiency
(TOPS/mm2)

𝟔𝟒 × 𝟔𝟒 Array

25

Area Efficiency
(Tb-Search/s/mm2)

Frequency

20.8

30

70
60
50
40
30
20
10
0

Power/Tile (mW)

𝟑𝟐 × 𝟑𝟐 Array

Frequency(MHz)

Frequency

Match
1-bit Mismatch

Array Size

Big
Data

Array Size

Tb-Search/s: The product of the search key width and search throughput.

Figure 6.26: (a) The distributions of the effective resistance for both match and 1-bit
mismatch cases. (b) The maximum operating frequency, power consumption, and array
efficiency under different array sizes. (c) The power efficiency and area efficiency for machine
learning and big data applications under different array sizes.

6.4.3

Discussion

To explore the design trade-offs between speed, power and area efficiency, we evaluate
the performance of Liquid Silicon under different array sizes by varying the number of
rows/columns. We first study how the maximum operating frequency, power consumption,
and array efficiency of one tile vary with different array sizes through simulation. As shown
in Figure 6.26b, the maximum operating frequency decreases when increasing the array size.
The reason is that larger array degrades sensing margin (Figure 6.26a) and thus the SA needs
a longer discharging time to generate the correct output. We also observe that the power
consumption increases in a larger array (Figure 6.26b), as it has more discharging paths
and a longer discharging time, which increase the voltage swing on the BL. Nevertheless, a
larger memory array has a higher array efficiency (Figure 6.26b).
To determine the optimal array size, we further evaluate the power efficiency (OPS/W)
133

and area efficiency (OPS/mm2 ) for mapping machine learning and big data applications
onto the tiles. As shown in Figure 6.26c, the power efficiency increases with a larger array
size, as increasing array size can improve the algorithmic mapping efficiency, i.e., performing
more effective computations (e.g. bitwise XNOR operations in the Neural Network mode) in
a single sensing operation. This improvement compensates the degradation in the operating
frequency and power consumption, thereby leading to a higher power efficiency. However,
when increasing the array size from 128 to 256, we see a diminishing return in the power
efficiency, as the degradation in frequency and power has outweighed the improvement in
the algorithmic mapping. We observe a similar trend for the area efficiency (Figure 6.26c).
When the array size is less than 256, the improvement in array efficiency can compensate the
degradation in the operating frequency, thereby leading to a higher area efficiency. However,
increasing the array size beyond 256 does not improve the area efficiency any further. To
account for these results and the die area constraint, we choose the array size of 128 × 128
in our implementation.

6.5

Extend Virtualization Solution

Instead of naively applying the proposed virtualization solution to the Liquid Silicon architecture, we can co-optimize the virtualization solution and the Liquid Silicon architecture
to maximize the performance. The key observation is that while using tiles for routing can
enable a flexible resource partition (Figure 6.3) and is efficient for short-distance interconnections, this routing solution is not efficient for long-distance interconnections that need
to be routed through several tiles. Based on the quantitative study, the routing consumes
about 42% of the total area and contributes to about 61% of the total delay in large applications that have a large number of long-distance interconnections. On the contrary, the
routing only consumes less than 12% of the total area and contributes to less than 19%
of the total delay in the same applications that only have short-distance interconnections.
The two-level system abstraction is modified to address this limitation.
As illustrated in Figure 6.27, the low-level abstraction is modified into a 2D array of LL

134

Low-Level
Abstraction for
Liquid Silicon

Low-Level
Virtual
Block

Low-Level
Virtual
Block

Low-Level
Virtual
Block

Low-Level
Virtual
Block

Low-Level
Virtual
Block

Low-Level
Virtual
Block

Latency-insensitive
Interface

Synchronous Interface

Interface to
Peripherals

Figure 6.27: A conceptual diagram illustrates low-level abstraction modified for Liquid
Silicon.
virtual blocks. One LL virtual block is mapped into a cluster of tiles, while the synchronous
interface for the communication between LL virtual blocks is implemented by a FPGA-like
routing fabric (Figure 6.28). This hybrid routing fabric effectively address the aforementioned limitation. Specifically, the FPGA-like segment-based routing fabric is efficient to
implement the long-distance global interconnections, while the tiles can efficiently implement the short-distance local interconnections. The clustering of tiles effectively reduces
the number of long-distance global interconnections, thereby avoiding the routing overhead
in conventional FPGA architecture. The compilation framework for this modified system
abstraction is drawn in Figure 6.29. Compared with the original compilation framework
(Figure 4.1), the key modifications are (1) the partition tool developed for the single-level
system abstraction is utilized to partition one HL virtual block into a 2D array of LL virtual
blocks, (2) the custom tool developed for Liquid Silicon is used to map one LL virtual block
into a cluster of tiles, (3) the routing tool in VTR is reused to route the interconnections
between clusters of tiles. An architecture file that defines a new type of block to represent
the cluster of tiles is provided for reusing VTR’s routing tool. And (4) no additional custom

135

Cluster

Connection
Block

Local
Routing
Input

Global
Routing

S/A
DFF

Tile
Output

Switch Block

Routing Channel

Figure 6.28: A conceptual diagram illustrates the hybrid routing fabric in the modified
Liquid Silicon. The cluster contains 2 × 2 tiles in the example.
tool is needed for relocation as the results generated by Liquid Silicon Place&Route tool is
already relocatable. While the 2D array of LL virtual blocks can have an arbitrary width,
we choose to restrict it to have a fixed width to reduce the compilation cost and simplify
the runtime management.

6.6

Results

This section first presents the simulation results that compares the Liquid Silicon architecture with the FPGA architecture, and then provides the chip measurement results. Finally,
the efficiency of the virtualization solution on Liquid Silicon architecture is evaluated.

6.6.1

Evaluation Setup

Benchmark Selection.
Several factors have been taken into consideration when selecting the benchmarks. 1)
They should be sufficiently representative and diverse enough to cover a range of workload
characteristics in the form of compute-to-memory access ratio [159], ranging from computeintensive applications to data-/search-intensive applications. 2) They should account for
the demands of potential new applications which may not be amenable for FPGA-like
136

High-Level Partition

High-Level
Synthesis

Applications
TensorFlow, OpenCL …

Low-Level
Partition

Verilog RTL

Technology
Mapping

Partition

Latency-Insensitive
Interface Generation

Parser

Custom
Interface
Description

Architecture File
Local Place&Route

Global Route

(Liquid Silicon Tool)

(VTR)

Bitstream

Figure 6.29: The compilation framework for the virtualized Liquid Silicon.
acceleration but can better exploit the flexible resources of Liquid Silicon.
Based on these factors, three sets of benchmarks are used: 1) traditional FPGA benchmarks, 2) search-intensive benchmarks and 3) binarized neural network benchmarks. The
traditional FPGA benchmarks contain benchmarks from the MCNC suite [145], which have
been widely used by the FPGA community to evaluate reconfigurable architectures [98].
We note that the benchmarks in this evaluation are originally developed for FPGA under
the constraint of limited on-chip memory support and thus are compute-intensive applications with high compute-to-memory access ratio. The search-intensive and binarized neural
network benchmarks, on the other hand, are representative of emerging applications with
low compute-to-memory access ratio and are mainly used to evaluate the unique flexibilities
of Liquid Silicon in supporting light-weight computation. In the following discussion, we
provide more details about the search-intensive and binarized neural network benchmarks.
The search-intensive benchmark set contains four representative workloads obtained
from a diverse set of application domains. In these benchmarks, most of the runtime/energy
is spent on the search operation, therefore they can be used to evaluate the light-weight
compute mode of Liquid Silicon. In addition, these benchmarks require different postmatch processing (e.g. priority encoding and population count) and have different search
key widths, thereby covering the different use cases. More details are presented in Table 6.1.

137

Table 6.1: Description of the search-intensive benchmark set.
Benchmark
String Match [108]

Description
Scan a list of encrypted words for
occurrences of a set of keys

Search Key

Post-match

Width (bit)

Processing

80

Bitwise OR

Classify every incoming packet
Packet Classification [121]

by comparing its header fields

Priority
104

Encoding

against a filter set.
Word Count [108]

Count the occurrences of each
unique word in a document.

184

Population
Count

Data mining technique for

Similarity Search [114]

various applications
[88][12][74][110].
A TCAM-based implementation

Priority
288

Encoding

is realized in [114].

The binarized neural network benchmark set contains five binary neural network (BNN)
designs. A BNN stores weights as 1-bit binary numbers (+1/-1), thereby significantly reducing the size of weights, as well as the computational complexity (perform bit-wise XNOR
instead of floating-point multiplication in Convolutional Neural Networks or CNNs) [28].
We note that the primary reason for choosing BNNs over CNNs in our evaluation is its
simplicity and efficiency to be implemented on hardware. Although the precision of weights
is reduced, BNNs still provide comparable classification accuracy compared to CNNs on
several datasets [28][27][124][69]. Due to these advantages, there has been a growing effort devoted to implementing BNN on different platforms, e.g., CPU, GPU, TrueNorth and
FPGA [124][163][99][28][27].
Among the five BNN benchmarks, two are binarized convolutional neural networks
(BNN1 and BNN2), while the rest are multilayer perceptrons (BNN3, BNN4 and BNN5)
with three hidden layers, as listed in Table 6.2. Their reported error rates are comparable
to their non-binarized counterparts [124][28]. In these BNN benchmarks, the key processing
units (perform XNOR, count and normalize operations) are highly pipelined and optimized
for performance. Additionally, we assume the training is done offline and only evaluate the

138

Table 6.2: Topology for BNN benchmarks.
Binarized CNN

Topology Description
3x3conv-3x3conv-2x2pool (output depth 64)

BNN1

3x3 conv - 3x3 conv - 2x2 pool (output depth 128)
3x3 conv - 3x3 conv - 2x2 pool (output depth 256)
two FC layers with 512 neurons
3x3conv-3x3conv-2x2pool (output depth 128)

BNN2

3x3 conv - 3x3 conv - 2x2 pool (output depth 256)
3x3 conv - 3x3 conv - 2x2 pool (output depth 512)
two FC layers with 1024 neurons

Binarized MLP

Neurons in each layer

BNN3

784-256-256-256-10

BNN4

784-1024-1024-1024-10

BNN5

784-2048-2048-2048-10

conv: convolution layer. pool: max pooling layer. FC: fully connected layer.

performance of inference as most prior works did.
Baseline.
FPGA is used as a baseline since it is the most popular commercially available reconfigurable architecture. More importantly, Liquid Silicon shares some similarities to FPGAs in
its morphable data-flow architecture, despite a number of radical differences (Section 6.2.3).
Moreover, the goal of this evaluation is to provide insights in comparing two architectures (Liquid Silicon, FPGA) paired with two technologies (RRAM, SRAM). To ensure the
benefits that we gained in Liquid Silicon are not simply due to the advance in technology,
we also include two more cases (RRAM-based FPGAs and SRAM-based Liquid Silicon) in
our evaluation. We note that commercially off-the-shelf FPGAs are SRAM-based FPGAs
and will be chosen to be our baseline. The RRAM-based FPGA is a drop-in replacement for
a SRAM-based FPGA while maintaining the same architecture. Similarly, in the SRAMbased Liquid Silicon, the RRAMs are replaced with SRAMs. In the rest of the discussion,
we also refer to Liquid Silicon as RRAM-based Liquid Silicon to distinguish it from SRAM-

139

based Liquid Silicon.
Simulation Setup.
For Liquid Silicon, 1) the applications are mapped using the custom compilation framework (Section 6.3), 2) the delay and power consumption are obtained from the HSPICE simulation (45nm PTM HP model [8]), and 3) the area is measured based on our custom physical design (Section 6.2.4-Physical Design). More specifically, in the HSPICE simulation,
two Verilog-A modules are created to simulate the behavior of the T aOx RRAM device [132]
and the diode [62]. The characteristics of RRAM devices are (1) Ron /Rof f = 5kΩ/ 100kΩ
and (2) 1.8V@100ns/-1V@100ns pulse for SET/RESET [133]. The turn on voltage of the
diode is 0.4V [62]. In the physical design, the size of the crossbar array is chosen to be
256 × 256. For FPGA, the mapping of benchmarks is generated by the VTR tool set. The
area, delay and power are estimated based on the models of SRAM-based FPGAs provided
by VTR. The architecture file k6 frac N10 frac chain mem32K 40nm.xml is used.

6.6.2

Traditional FPGA Benchmarks

In this evaluation, the traditional FPGA benchmark set is applied to evaluate the performance of Liquid Silicon. The SRAM-based FPGA architecture is chosen as the baseline,
and the performances of other architectures are normalized to it.
Area: RRAM-based Liquid Silicon achieves 81% area savings compared to SRAMbased FPGAs (Figure 6.30). We also observe that it consumes 31% less area than RRAMbased FPGAs, which implies that this area reduction is not simply due to a drop-in replacement for SRAMs with dense RRAMs. It is also worth noting that among the four
architectures, SRAM-based Liquid Silicon has the largest area cost, indicating that Liquid
Silicon is more amenable to pair with the RRAM technology. Our evaluation confirmed
the compilation framework can effectively utilize tiles, and the average utilization of tiles is
above 70%.
Delay: The delay results are presented in Figure 6.30. For all benchmarks, RRAMbased Liquid Silicon outperforms the other three architectures on average. The improvement

140

SRAM-based FPGA
1118

542

989

SRAM-based Liquid Silicon
747

232

335

405

RRAM-based Liquid Silicon
318

557

508

289

Save
81%

150
100
50
0

Reduce
52%

150
100
50
0
380

391

346

662

427

331

354

202

311

118 114

266

222

Improve
86%

150
100
50
0

Routing
Usage (%)

EDP (%) Delay (%)

Area (%)

776

RRAM-based FPGA

60
40
20
0

58.36%
15.41%

Figure 6.30: From top to bottom are Area, delay, energy efficiency (energy-delay product,
EDP) and routing usage results. Results of the SRAM-based FPGA are used as baseline,
and other results are normalized to them. The routing usage is the ratio between routing
area and total used area when mapping benchmark circuits. In Liquid Silicon, it is obtained
by first calculating the ratio between routing area and total used area (routing+logic) of
each tile and averaging across all tiles.
in delay mainly comes from the coarse-grained logic implementation of Liquid Silicon, which
has also been confirmed in our experiments. Specifically, the technology mapping stage
generates net lists with much shallower depth, resulting in a reduced number of logic gates
on the critical paths in Liquid Silicon. Therefore, the delay of Liquid Silicon is 52% less
than that of SRAM-based FPGAs, on average.
The delays of the SRAM-based Liquid Silicon and RRAM-based FPGA increase by
14% and 18%, respectively, compared to the SRAM-based FPGA. We note that the delay
becomes worse in SRAM-based Liquid Silicon due to the fact that the increase in unit
delay per tile caused by longer wires (larger RC constant) is much more significant than the
decrease in logic depth. The minor increase in delay for the RRAM-based FPGA is because
of the longer sensing time of the RRAM array due to its higher Ron compared to that of a
MOSFET.
Another advantage of Liquid Silicon is that it makes better use of the routing resources as
compared with FPGAs, due to its coarse-grained logic implementation and flexible resource
partitioning between logic and routing. Overall, the area consumed by routing in Liquid
141

Area (%)
Throughput
(%)
Power (%)

51.2
6.7
0.4

100
10
1
0.1
0.01
Increase with wider key
148.7
150
126.4 130.9
100
50
0

169.2

100
80
60
40
20
0

142.9

70.7
23.4
6.0

SRAM-based
FPGA

RRAM-based
FPGA
SRAM-based
Liquid Silicon
RRAM-based
Liquid Silicon

Figure 6.31: The area saving (top), throughput improvement (middle) and power reduction
(bottom) are presented. All results are normalized to that of SRAM-based FPGA. The
area result is plotted in logarithmic scale.
Silicon is only 15% of the total used area, compared to 58% in the SRAM-based FPGA
(Figure 6.30).
Energy Efficiency: The energy efficiency results are presented in Figure 6.30. On
average, RRAM-based Liquid Silicon achieves 86% improvement in energy efficiency compared to the SRAM-based FPGA. This improvement is mainly due to the coarse-grained
logic implementation of Liquid Silicon. Not only does the coarse-grained logic implementation lead to a smaller delay, but it also consumes less hardware resources used for routing
and thus reduces energy consumption on data transfer.

6.6.3

Search-intensive Applications

In this subsection, the search-intensive benchmark set is applied to evaluate Liquid Silicon.
All results are normalized to the SRAM-based FPGA baseline. Overall, across all searchintensive workloads, we observe that the area/throughput/power improvement in Liquid
Silicon are substantially higher than that in the FPGA baseline as well as the other two
architectures (RRAM-based FPGA and SRAM-based Liquid Silicon), as compared with the
evaluation results for traditional FPGA benchmarks in Section 6.6.2. Thus, we confirm the
142

effectiveness of the new light-weight compute mode of Liquid Silicon in accelerating searchintensive applications. In the following discussion, we present the detailed evaluation results
for area, throughput and power for these benchmarks.
Area: On the four evaluated benchmarks, Liquid Silicon achieves a 99.6% average area
reduction compared with the FPGA baseline (SRAM-based FPGA) (Figure 6.31-top). It is
also interesting to observe that the SRAM-based Liquid Silicon even consumes 44.5% less
area than the RRAM-based FPGA, indicating that the benefits gained from the light-weight
compute mode prevails the area overhead imposed by the larger cell size of SRAM.
Throughput: On average, the RRAM-based Liquid Silicon achieves 1.43× improvement in search throughput over the FPGA baseline. Across the benchmarks, we observe a
trend that the improvement in search throughput increases with a wider search key. This is
mainly because a wide/long search key needs to be split into multiple smaller ones and fed to
a local search unit that is implemented using a large amount of logic and BRAM resources
in FPGA [61]. The search outputs from these local search units need to be collected globally
and processed to generate a final result. Such a distributed implementation consumes more
routing resources, thereby increasing the search delay and reducing the throughput. On
the contrary, Liquid Silicon is capable of processing wide search keys locally and directly
by coalescing adjacent tiles configured in the light-weight compute mode. Therefore, the
search delay and throughput of Liquid Silicon is insensitive to the search key width.
Power: For all benchmarks, Liquid Silicon outperforms the other three architectures in
power consumption, and achieves a 94.0% average power reduction compared to the FPGA
baseline. The power saving is mainly due to the more efficient mapping of search operations
on Liquid Silicon, which dominate the power consumption in these applications.

6.6.4

Neural Network Benchmarks

In this subsection, we further evaluate the effectiveness of Liquid Silicon in accelerating
neural network workloads (specifically BNNs) which are data-intensive. The evaluation
results for FPGAs and Liquid Silicon on BNNs are presented in Figure 6.32. For Liquid
Silicon, on average, it achieves 52.3× speedup, 113.9× reduction in energy consumption,
143

Runtime
Speedup
Normalized
Energy

10 0
10-1
10-2
10-3

Normalized
Area (%)

100
10
1

52.3
SRAM-based
FPGA
RRAM-based
0.01 FPGA
229

100
75
50
25
0

401

321

236

336

297 SRAM-based
Liquid Silicon
18.9

BNN1 BNN2 BNN3 BNN4 BNN5

RRAM-based
Geomean Liquid Silicon

Figure 6.32: The runtime speedup (top), energy consumption (middle) and area (bottom)
results are presented. All results are normalized to that of SRAM-based FPGA.
and 81% area reduction compared with the FPGA baseline (SRAM-based FPGA). This
improvement mainly comes from two sources. First, in Liquid Silicon, the weights are stored
and processed (bit-wise XNOR) in situ inside the same RRAM crossbar, therefore eliminating the frequent memory access to fetch key neural network parameters (e.g. weights)
in FPGA. The second reason is that the count, normalization and activation operations are
all performed in connection nodes using simple analog circuits, i.e., S/A (as discussed in
Section 6.2.2), whereas FPGA needs complex logic i.e. adders and comparators to perform
these operations, resulting in larger delay and energy consumption. Moreover, as the onchip memory capacity is fixed in FPGA but can be flexibly configured by users in Liquid
Silicon, Liquid Silicon is expected to achieve even more improvement in performance and
energy efficiency than FPGA as the size of the neural network grows exceeding the capacity
of the FPGA’s on-chip memory.

6.6.5

Chip Results

Liquid Silicon test chip is fabricated in commercial 130-nm CMOS process and HfO2 /Ti/TiN
RRAM technology. Figure 6.33 shows the die photo and the major integration process flow.
OxRAM technology is integrated with Si CMOS by first defining the TiN bottom electrode
on top of the Cu Metal 4, and then depositing an HfO2 10nm/Ti 10nm/TiN stack after the
CMP touch [50]. More details about the fabrication process can be found in [50].
144

2091µm
Standard Foundary Wafer
CMOS 130nm + 4 Cu Metal

1663µm

TiN Bottom
Electrode Definition

Tile

CMP touch
Memory stack deposition
(HfO2 10nm/Ti 10nm/TiN)

Testing Board

Ø300nm Mesa
Patterning

1T1R Cell
RRAM

AlCu M5

Via

Cu M4

Connection
Nodes

Control &
Timing Gen

RRAM
Array

Connection
Nodes

Cu M3

Encapsulation
and CMP
Via

(a)

(b)

M5

104

105

106

107

Frequency (MHz)

Resistance (Ω)

(c)
(b)

BL

103

CLK

(a)

Sensing
Output

Frequency

Figure 6.33: (a) Die photo and (b) the integration flow [50].

1

0

Supply Voltage (V)

Figure 6.34: (a) The measured resistance distribution under the switching condition: Forming 4V@40µs, SET 2V@100ns, RESET 2.5V@100ns, (b) the measured voltage frequency
scaling, and (c) the measured waveform for logic ‘1’ output (Computation mode: ‘True’,
Storage mode: ‘1’, Search mode: ‘match’, NN mode: ‘active’) and logic ‘0’ output (Computation mode: ‘False’, Storage mode: ‘0’, Search mode: ‘mismatch’, NN mode: ‘inactive’).
These measurements are conducted at room temperature.

145

Table 6.3: Liquid Silicon Chip Specification
Process Technology

130-nm CMOS + HfO2 RRAM

Cell Structure

1T1R

Cell Size

1.83 × 4 µm2

Array Size

128 × 128 bit

Number of Tiles

2

Frequency

10 MHz

Supply Voltage

0.65∼1.2V

Power Efficiency

60.9 TOPS/W

Area Efficiency

188.4 GOPS/mm2

RRAM
Switching
Condition

Forming: 4V@40µs
SET: 2V@100ns
RESET: 2.5V@100ns

We first measure the RRAM resistance distribution [50] and the average resistance
ratio is 2500 (Figure 6.34a). We then measure the waveform (Figure 6.34c) to confirm the
sensing operation performed by the connection nodes. The measured maximal operating
frequency under different supply voltages is shown in Figure 6.34b. The results show that
Liquid Silicon chip can reliably operate when scaling the voltage from 1.2V to 0.65V, with
a 2.7mW power consumption per tile at the nominal supply voltage of 1.2V. The chip
specification is summarized in Table 6.3.
Comparison With Prior AI Accelerators
We first compare Liquid Silicon with prior CMOS-based [95][148][68][147][6] and RRAMbased [116][86] accelerators. These domain-specific accelerators achieve high efficiency at
the cost of limited flexibility, i.e., they can only support machine learning applications.
On the contrary, Liquid Silicon is not only more flexible (support both machine learning
and big data applications), but also outperforms these accelerators in the machine learning
applications.
In order to quantitatively evaluate the power and area efficiency, we map a fully connected binarized neural network onto Liquid Silicon and these AI accelerators. The results are summarized in Table 6.4. Overall, Liquid Silicon achieves better or comparable
146

Table 6.4: Comparison with state-of-the-art AI chips

Metric

Moons

Yin

Khwa

Yin

Ando

Su

Liu

Liquid

CICC

VLSI

ISSCC

VLSI

VLSI

VLSI

ISSCC

Silicon

2018

2018

2018

2018

2017

2017

2016

[95]

[148]

[68]

[147]

[6]

[116]

[86]

CMOS Process

130nm

28nm

28nm

65nm

28nm

65nm

150nm

65nm

RRAM Type

HfO2

-

-

-

-

-

HfO2

TiN/TiON

60.95

230

90

111.6

19.9

2.3

0.46

0.03

188.4

232.1

99.1

N/A

33.8

365

2.77

0.02

Power
Efficiency
(TOPS/W)
Area
Efficiency
2

(GOPS/mm )

power efficiency (60.95 TOPS/W) and area efficiency (188.4 GOPS/mm2 ) than these AI
accelerators, even Liquid Silicon is fabricated in an older-generation CMOS process technology. Comparing with the CMOS-based AI accelerators, Liquid Silicon improves the power
and area efficiency by 3.1× and 5.6×, respectively. This improvement mainly comes from
the better algorithmic mapping in Liquid Silicon, i.e., Liquid Silicon can perform multiple
computations (e.g. XNOR and count) in a single sensing operation, while CMOS-based
accelerators need to implement discrete logic gates with extensive routing to perform these
computations. Liquid Silicon also improves power efficiency (> 132×) and area efficiency
(> 68×) over the RRAM-based accelerators. This is because these accelerators either 1)
simply use RRAM cells as non-volatile storage units and still perform computation using discrete logic gates, or 2) use RRAM crossbar to perform multiplication in the analog
domain, and the performance suffers from the power-hungry and area-inefficient multi-bit
ADC/DAC.
We provide a back-of-the-envelop calculation to explain why Liquid Silicon using an
older-generation CMOS technology (130nm) when being augmented with post-CMOS technology (i.e., RRAM) can achieve better area efficiency than the CMOS-based accelerators at
an advanced technology node (28nm). To make an apple-to-apple comparison, we choose to
compare Liquid Silicon with the chip reported in [148], since the algorithm (neural network

147

model) used in both chips are the same. For Liquid Silicon, the tile area is 0.46mm2 with
a 26% array efficiency. In every clock cycle, one column (BL) in the tile performs 128 operations (64 XNOR and 64 addition), thus, the area efficiency of one tile is 360GOP S/mm2 .
The overall area efficiency is 198GOP S/mm2 , since tiles occupy about 55% of the total die
area. For the CMOS-based accelerator, its area efficiency is reported as 99.1GOP S/mm2 .
Thus, Liquid Silicon can achieve 1.99× higher area efficiency than this CMOS-based accelerator, even when Liquid Silicon is fabricated in an older-generation CMOS technology,
which is consistent with the data reported in Table 6.4. We also note that, if considering
technology scaling, i.e., Liquid Silicon is also fabricated in an advanced CMOS technology
node (28nm), it will achieve 42× higher area efficiency than the CMOS-based accelerator
in [148].
Comparison With nv-FPGA
We then compare Liquid Silicon with the nonvolatile FPGA (nv-FPGA) [118][84]. Both
of them provide high flexibility, i.e., support both machine learning and big data applications, but Liquid Silicon has higher efficiency due to its novel architecture. To quantitatively
evaluate the efficiency, we map a fully connected binarized neural network and a contentbased similarity search (a key big data application [114]) onto Liquid Silicon and nv-FPGA.
The results show that Liquid Silicon achieves better power and area efficiency than nvFPGA (Table 6.5), mainly because 1) nv-FPGA uses small look-up tables (LUTs) and an
extensive routing network to implement discrete logic gates (e.g. XNOR) for computation,
whereas Liquid Silicon consolidates more effective computations within a compact array
structure, resulting in a better algorithmic mapping, and 2) nv-FPGA does not provide a
native support for the search operation (use hashing-based implementation) while Liquid
Silicon provides a dedicated operational mode (Search) to achieve higher mapping efficiency.
As a result, Liquid Silicon achieves high power efficiency (0.48 TOPS/W) and area efficiency
(0.74 GOPS/mm2 ) for the search-intensive application, which is > 100× higher than that
of nv-FPGA.

148

Table 6.5: Comparison with nv-FPGA
Liquid

Suzuki

Liauw

Metric

Silicon

VLSI

ISSCC

2017 [118]

2012 [84]

CMOS Process

130nm

90nm

180nm

RRAM Type

HfO2

∗

p-MTJ

AlOx

Machine

Power Efficiency (TOPS/W)

60.95

2

0.63

Learning

Area Efficiency (GOPS/mm2 )

188.4

0.2

0.1

Big

Power Efficiency (TOPS/W)

0.48

0.004

0.0013

Data

Area Efficiency (GOPS/mm2 )

0.74

0.0003

0.0002

* Perpendicular magnetic tunnel junction (p-MTJ) devices, another type of nonvolatile memory.

6.6.6

Virtualization Evaluation

This subsection evaluates the performance of the virtualized Liquid Silicon. Specifically,
5 × 5 tiles are grouped together to form one block, which has 25 outputs and 100 input
at each side. For the connection blocks, Fcin is set to 0.15, and Fcout is set to 0.10. The
routing channel consists of segments of length 2 wires, while the switch block type is chosen
to be Wilton [90]. As the purpose of this evaluation is to demonstrate that the proposed
virtualization solution can be efficiently extended to Liquid Silicon, so the design space (such
as the block size) is not explored, which is one possible future work. Without completing the
physical design, it is difficult to determine the exact area savings obtained from burying the
routing circuits under the crossbar array using monolithic 3D integration in the virtualized
Liquid Silicon. Thus, the area of the worst case is presented in the evaluation, i.e., the
routing circuits are not buried and the area of these circuits is simply added with the area
of blocks to obtain the overall area results. Nine large benchmarks from MCNC benchmark
set that require at least two blocks are used in the evaluation. The VTR framework is
applied to obtain the performance of the FPGA architecture.
The performance comparison between the virtualized Liquid Silicon and the non-virtualized
one is presented in Figure 6.35. Overall, the virtualized Liquid Silicon reduces the routing
latency by 46.9% compared to the non-virtualized one, indicating that the abstraction-

149

Normalized Result (%)

180
160
140
120
100
80
60
40
20
0

Increase 19.2%
Decrease 46.9%
Decrease 66.2%
apex2 clma

des

diffeq elliptic

frisc

s38417 s38584

tseng

Area
Delay
EDP

Geomean

Figure 6.35: The area, delay and EDP (energy-delay product) results of mapping application
onto the virtualized Liquid Silicon architecture, which are normalized to those of the nonvirtualized one.

Normalized Delay (%)

80
60
40
20
0

Reduce
36.0%

Normalized #Track/µm (%)

80

100

60

Reduce
50.2%

40
20
0

Figure 6.36: The delay result (left) and the number of tracks per unit length (right) of
mapping application onto the virtualized Liquid Silicon, which are normalized to that of
the FPGA architecture.
architecture co-optimization effectively improves the routing performance. This shorter
routing latency also leads to better energy efficiency (energy-delay product), which is improved by 66.2% in the virtualized Liquid Silicon. Nevertheless, the area of mapping applications onto the virtualized Liquid Silicon is 19.2% larger than the non-virtualized one.
This is mainly caused by the additional FPGA-like routing circuits. The trade-off between
the routing area and routing latency is also one interesting direction to be explored in the
future work.
The performance of the virtualized Liquid Silicon is then compared to that of the FPGA
architecture. As shown in Figure 6.36, the virtualized Liquid Silicon reduces the total
latency by 36.0% compared to FPGA. Moreover, using tiles for local routing within one
block also reduces the global routing pressure. Thus, the virtualized Liquid Silicon also
reduces the amount of required routing resources (in terms of the number of required routing
channels) by 50.2% on average.

150

Normalized Runtime

160%
140%
120%
100%
80%
60%
40%
20%
0%
100%
80%
60%
40%
20%
0%

Reduce
59.36%

Low-Level Partition
Local P&R

Reduce
81.46%

Global P&R

Figure 6.37: The runtime of the compilation framework developed for the virtualized environment is normalized to that of the framework for the non-virtualized one. Results of
sequentially executing all compilation tasks (top) and parallel executing all tasks (bottom)
are presented. Only the key compilation tasks are drawn in the figure for simplicity.
Finally, the compilation time reduction is shown in Figure 6.37. Specifically, the runtime
of the compilation framework for virtualized Liquid Silicon is 2.46× shorter than the one
developed for the non-virtualized one. If performing compilation tasks in parallel, it achieves
5.39× reduction. This reduction is higher than that in the FPGA compilation framework
(Figure 4.13). This is because the simulated annealing algorithm used in Liquid Silicon’s
compilation framework has a higher timing complexity than the algorithm used in FPGA
compilation framework. Thus, partitioning applications and performing the placement at a
smaller granularity are more beneficial in Liquid Silicon.

151

Chapter 7
Conclusion
A two-level system abstraction is developed for virtualizing the heterogeneous FPGA cluster, which decouples the conflicting requirements from runtime management and offline
compilation. Specifically, the high-level abstraction provides a homogeneous view of the
FPGA resources to simplify the runtime management. It also provides an asynchronous interface for the communication to enable a flexible runtime deployment and support various
inter-FPGA network. An all-to-all network is also included in the high-level abstraction to
provide an efficient support for large FPGA applications with a high rent’s exponent. On
the contrary, the low-level abstraction is designed to be FPGA specific to expose the spatial
resource constraints to the compilation framework to ensure the mapping quality. A synchronous interface is applied in the low-level abstraction so that the compilation framework
can fully utilize the on-chip routing fabric. Simple direct interconnections are included for
the communication that minimizes the amount of resources reserved by the system, thereby
maximizing the amount of resources available to users.
We further show that this two-level system abstraction can be specialized into a singlelevel one to virtualize the homogeneous FPGA cluster. Compared to the two-level system
abstraction, this single-level one reduces the compilation overhead at the cost of a reduced
mapping quality. Thus, it can be utilized for the applications that do not have strict requirements on performance. We also note that due to the reconfigurability provided by
FPGAs, these two system abstractions can co-exist in one homogeneous FPGA cluster to
balance the compilation cost and the compilation quality. This generic two-level system ab152

straction can also be extended to leverage application-specific information to better support
the SaaS model. In this dissertation, we use the application-specific ISA as a case study to
demonstrate this.
A compilation framework is developed for mapping applications onto the two-level system abstraction. The key design principle is maximally reusing the commercial FPGA
compilation tools to minimize the engineering efforts and ensure the compilation quality.
The compilation framework is also extended to support the single-level system abstraction
and the abstraction for application-specific ISA.
Enabled by the two-level system abstraction, a two-level modular runtime system is designed to provide a good extendability across different heterogeneous FPGA cluster. When
a new type of FPGAs is integrated in the cloud, only a new bottom-level controller needs to
be added into the management system without modifying other components. We also provide a heuristic-based resource management policy to minimize the resource waste caused
by the fragmentation issue. This heuristic-based policy can be easily extended to take other
runtime factors into consideration, such as the contention on the DRAM bandwidth.
Finally, we use Liquid Silicon, an RRAM-based homogeneous reconfigurable architecture, as a case study to show that the proposed virtualization solution can be extended to
other spatial reconfigurable architectures. Instead of naively applying the proposed virtualization solution onto Liquid Silicon, the two-level system abstraction and the Liquid Silicon
architecture are co-optimized to maximize the efficiency.

7.1

Limitation and Possible Future Works

This dissertation presents our initial efforts on the virtualization of reconfigurable architectures. Nevertheless, we also note that virtualizing reconfigurable architectures is a more
challenging task than virtualizing traditional computing devices (e.g., CPUs). The major
limitations of this dissertation and several possible future directions are listed:
1. Exploring abstraction-architecture co-design for FPGAs
In order to be applied to commercial FPGA devices, the system abstraction proposed in

153

this dissertation is designed under the given architectural constraints. It might be interesting
to explore the abstraction-architecture co-design for the FPGA architecture to answer the
question — how the FPGA architecture should be designed to provide better virtualization
support. Section 3.5.4 discusses several possible modifications on the FPGA architecture
from the perspective of resource utilization. Nevertheless, improving resource utilization is
only one goal of virtualization, and other design goals, such as providing better isolation for
both security and performance, should also be considered when exploring the abstractionarchitecture co-design.
2. Comprehensive exploration on runtime system
This dissertation focuses on the development of the system abstraction and the compilation framework, while only a basic runtime system is provided for evaluating the runtime
performance. A more comprehensive exploration on the runtime system could be one possible future direction. This future direction contains several possible tasks. (1) A better
strategy for sharing peripheral devices, such as on-board DRAM. Specifically, the peripheral
devices are shared in a round-robin manner among physical blocks in the current system
abstraction design, which could be replaced with a more sophisticated strategy to reduce
the performance interference. Moreover, the heuristic-based resource allocation policy could
also be extended to take the peripheral devices into consideration. For instance, the policy
could be modified to avoid deploying two applications that have high demand on the onboard DRAM bandwidth onto the same physical FPGA device to reduce the performance
interference. (2) A comprehensive evaluation on the runtime performance. In particular, it
might be interesting to evaluate the benefits of supporting heterogeneous FPGA clusters,
i.e., the benefits of splitting one application and deploying it onto different types of FPGAs.
This is not well evaluated in this dissertation due to the limited size of the custom FPGA
cluster. (3) A runtime system that provides additional useful features, such as supporting
workload migration for fault tolerance.
3. Better virtualization support for Liquid Silicon
In this dissertation, Liquid Silicon architecture is used as a case study to show that

154

the proposed virtualization solution can be extended to other reconfigurable architectures.
Only necessary modifications are applied to the proposed virtualization solution during the
extension, while custom tools are reused as much as possible to minimize the engineering
efforts. Nevertheless, Liquid Silicon architecture provides additional flexibility compared
to the FPGA architecture. For instance, Liquid Silicon architecture has less restricted
constraints on the shape of physical blocks compared to the FPGA architecture. It might
be interesting to further optimize the virtualization solution for Liquid Silicon by exploiting
the unique flexibility provided by the Liquid Silicon architecture.

155

BIBLIOGRAPHY
[1] Amazon. Amazon EC2 Pricing. https://aws.amazon.com/ec2/pricing/.
[2] Amazon. Amazon EC2 Spot Instances Pricing. https://aws.amazon.com/ec2/spot/
pricing/.
[3] Amazon. Introducing Amazon EC2 P3 Instances. https://aws.amazon.com/
about-aws/whats-new/2017/10/introducing-amazon-ec2-p3-instances/.
[4] Amazon.
Amazon EC2
instance-types/f1/, 2016.

F1

Instances.

https://aws.amazon.com/ec2/

[5] Amazon. Accelerated Computing on AWS. http://asapconference.org/slides/
amazon.pdf, 2017.
[6] K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato, H. Nakahara, M. Ikebe,
T. Asai, S. Takamaeda-Yamazaki, T. Kuroda, et al. Brein memory: A 13-layer 4.2 k
neuron/0.8 m synapse binary/ternary reconfigurable in-memory deep neural network
accelerator in 65 nm cmos. In 2017 Symposium on VLSI Circuits, pages C24–C25.
IEEE, 2017.
[7] M. Asiatici, N. George, K. Vipin, S. A. Fahmy, and P. Ienne. Virtualized Execution
Runtime for FPGA Accelerators in the Cloud. IEEE Access, 5:1900–1910, 2017.
[8] ASU. Predictive technology model (ptm). http://ptm.asu.edu/.
[9] V. Baena-Lecuyer, M. Aguirre, A. Torralba, L. G. Franquelo, and J. Faura. DecoderDriven Switching Matrices in Multicontext FPGAs: Area Reduction and Their Effect
on Routability. In Circuits and Systems, 1999. ISCAS’99. Proceedings of the 1999
IEEE International Symposium on, volume 1, pages 463–466. IEEE, 1999.
[10] U. Berkeley. Berkeley logic interchange format (BLIF), 1992.
[11] A. Brant and G. G. Lemieux. ZUMA: An Open FPGA Overlay Architecture. In
FCCM, pages 93–96. IEEE, 2012.
[12] J. Buhler. Efficient large-scale sequence comparison by locality-sensitive hashing.
Bioinformatics, 17(5):419–428, 2001.

156

[13] S. Byma, J. G. Steffan, H. Bannazadeh, A. L. Garcia, and P. Chow. FPGAs in the
Cloud: Booting Virtualized Hardware Accelerators with OpenStack. In 2014 IEEE
22nd Annual International Symposium on Field-Programmable Custom Computing
Machines, pages 109–116. IEEE, 2014.
[14] Cadence.
Protium
S1
FPGA-Based
Prototyping
Platform.
https://www.cadence.com/content/dam/cadence-www/
global/en_US/documents/tools/system-design-verification/
protium-s1-fpga-based-prototyping-platform-ds.pdf.
[15] E. Caspi, M. Chu, R. Huang, J. Yeh, J. Wawrzynek, and A. DeHon. Stream Computations Organized for Reconfigurable Execution (SCORE). In International Workshop
on Field Programmable Logic and Applications, pages 605–614. Springer, 2000.
[16] E. Caspi, A. DeHon, and J. Wawrzynek. A Streaming Multi-Threaded Model, 2001.
[17] A. M. Caulfield, E. S. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman,
S. Heil, M. Humphrey, P. Kaur, J.-Y. Kim, et al. A Cloud-Scale Acceleration Architecture. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture, pages 1–13. IEEE, 2016.
[18] D. Chang and M. Marek-Sadowska. Partitioning Sequential Circuits on Dynamically
Reconfigurable FPGAs. IEEE Transactions on Computers, 48(6):565–578, 1999.
[19] M.-F. Chang, C.-C. Lin, A. Lee, C.-C. Kuo, G.-H. Yang, H.-J. Tsai, T.-F. Chen, S.-S.
Sheu, P.-L. Tseng, H.-Y. Lee, et al. 17.5 A 3T1R nonvolatile TCAM using MLC
ReRAM with Sub-1ns search time. In Solid-State Circuits Conference-(ISSCC), 2015
IEEE International, pages 1–3. IEEE, 2015.
[20] A. Chen. A Comprehensive Crossbar Array Model With Solutions for Line Resistance and Nonlinear Device Characteristics. IEEE Transactions on Electron Devices,
60(4):1318–1326, 2013.
[21] F. Chen, Y. Shan, Y. Zhang, Y. Wang, H. Franke, X. Chang, and K. Wang. Enabling
FPGAs in the Cloud. In Proceedings of the 11th ACM Conference on Computing
Frontiers, pages 1–10, 2014.
[22] Y.-C. Chen et al. Non-volatile 3D Stacking RRAM-based FPGA. In FPL, 2012.
[23] E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield, T. Massengill,
M. Liu, D. Lo, S. Alkalay, M. Haselman, et al. Serving DNNs in Real Time at
Datacenter Scale with Project Brainwave. IEEE Micro, 38(2):8–20, 2018.
[24] E. S. Chung, J. D. Davis, and J. Lee. Linqits: Big data on little clients. In ACM
SIGARCH Computer Architecture News, volume 41, pages 261–272. ACM, 2013.
[25] J. Cong et al. FPGA-RPI: A Novel FPGA Architecture with RRAM-based Programmable Interconnects. IEEE Transactions on VLSI, 22(4):864–877, 2014.

157

[26] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang. High-Level
Synthesis for FPGAs: From Prototyping to Deployment. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 30(4):473–491, 2011.
[27] M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural
networks with binary weights during propagations. In Advances in Neural Information
Processing Systems, pages 3123–3131, 2015.
[28] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural
networks: Training deep neural networks with weights and activations constrained
to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016.
[29] L. H. Crockett, R. A. Elliot, M. A. Enderwitz, and R. W. Stewart. The Zynq
Book: Embedded Processing with the Arm Cortex-A9 on the Xilinx Zynq-7000 All
Programmable Soc. Strathclyde Academic Media, 2014.
[30] T. S. Czajkowski, U. Aydonat, D. Denisenko, J. Freeman, M. Kinsner, D. Neto,
J. Wong, P. Yiannacouras, and D. P. Singh. From OpenCL to High-Performance
Hardware on FPGAs. In FPL, pages 531–534. IEEE, 2012.
[31] G. Dai, Y. Chi, Y. Wang, and H. Yang. FPGP: Graph Processing Framework on
FPGA A Case Study of Breadth-First Search. In FPGA, pages 105–110. ACM, 2016.
[32] G. Dai, T. Huang, Y. Chi, N. Xu, Y. Wang, and H. Yang. ForeGraph: Exploring
Large-Scale Graph Processing on multi-FPGA Architecture. In FPGA, pages 217–226.
ACM, 2017.
[33] A. DeHon. DPGA Utilization and Application. In Fourth International ACM Symposium on Field-Programmable Gate Arrays, pages 115–121. IEEE, 1996.
[34] A. Dehon. Nanowire-based Programmable Architectures. ACM Journal on Emerging
Technologies in Computing Systems (JETC), 1(2):109–162, 2005.
[35] A. DeHon, Y. Markovsky, E. Caspi, M. Chu, R. Huang, S. Perissakis, L. Pozzi, J. Yeh,
and J. Wawrzynek. Stream Computations Organized For Reconfigurable Execution.
Microprocessors and Microsystems, 30(6):334–354, 2006.
[36] C. Dong et al.
3-D nFPGA: A Reconfigurable Architecture for 3-D
CMOS/nanomaterial Hybrid Digital Circuits. IEEE Transactions on Circuits and
Systems I: Regular Papers, 54(11):2489–2501, 2007.
[37] N. Engelhardt and H. K.-H. So. GraVF: A Vertex-Centric Distributed Graph Processing Framework on FPGAs. In FPL, pages 1–4. IEEE, 2016.
[38] S. A. Fahmy, K. Vipin, and S. Shreejith. Virtualized FPGA Accelerators for Efficient
Cloud Computing. In Cloud Computing Technology and Science, 2015 IEEE 7th
International Conference on, pages 430–435. IEEE, 2015.
[39] D. G. Feitelson and A. M. Weil. Utilization and Predictability in Scheduling the
IBM SP2 with Backfilling. In Proceedings of the First Merged International Parallel
158

Processing Symposium and Symposium on Parallel and Distributed Processing, pages
542–546. IEEE, 1998.
[40] J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo, S. Alkalay,
M. Haselman, L. Adams, M. Ghandi, et al. A Configurable Cloud-Scale DNN Processor for Real-Time AI. In 2018 ACM/IEEE 45th Annual International Symposium
on Computer Architecture (ISCA), pages 1–14. IEEE, 2018.
[41] P. Francisco et al. The Netezza data appliance architecture: A platform for high
performance data warehousing and analytics, 2011.
[42] P.-E. Gaillardon et al. GMS: Generic Memristive Structure for Non-volatile FPGAs.
In VLSI-SoC, pages 94–98. IEEE, 2012.
[43] C. Gao, D. Neil, E. Ceolini, S.-C. Liu, and T. Delbruck. DeltaRNN: A Power-efficient
Recurrent Neural Network Accelerator. In FPGA, pages 21–30. ACM, 2018.
[44] T. Geng, T. Wang, A. Sanaullah, C. Yang, R. Xu, R. Patel, and M. Herbordt. FPDeep:
Acceleration and Load Balancing of CNN Training on FPGA Clusters. In FCCM,
pages 81–84. IEEE, 2018.
[45] E. I. Goldberg, M. R. Prasad, and R. K. Brayton. Using SAT for Combinational
Equivalence Checking. In Proceedings Design, Automation and Test in Europe. Conference and Exhibition 2001, pages 114–121. IEEE, 2001.
[46] S. C. Goldstein and M. Budiu. NanoFabrics: Spatial Computing Using Molecular
Electronics. In ISCA 01. Citeseer, 2001.
[47] Google. Cloud TPU - System Architecture. https://cloud.google.com/tpu/docs/
system-architecture.
[48] Google. Cloud TPU Pricing. https://cloud.google.com/tpu/pricing.
[49] Google. GCP Pricing — Google Cloud. https://cloud.google.com/pricing.
[50] A. Grossi, E. Nowak, C. Zambelli, C. Pellissier, S. Bernasconi, G. Cibrario, K. El Hajjam, R. Crochemore, J. Nodin, P. Olivo, et al. Fundamental Variability Limits
of Filament-based RRAM. In 2016 IEEE International Electron Devices Meeting
(IEDM), pages 4–7. IEEE, 2016.
[51] L. Guo, Y. Chi, J. Wang, J. Lau, W. Qiao, E. Ustun, Z. Zhang, and J. Cong. AutoBridge: Coupling Coarse-Grained Floorplanning and Pipelining for High-Frequency
HLS Design on Multi-Die FPGAs. In 2021 International Symposium on FieldProgrammable Gate Arrays (FPGA), 2021.
[52] T. R. Halfhill. Tabula’s Time Machine Rapidly Reconfigurable Chips Will Challenge
Conventional FPGAs. Microprocessor report, 2010.
[53] K. Huang et al. A Low Active Leakage and High Reliability Phase Change Memory
(PCM) Based Non-volatile FPGA Storage Element. IEEE Transactions on Circuits
and Systems I: Regular Papers, 61(9):2605–2613, 2014.
159

[54] M. Huang, D. Wu, C. H. Yu, Z. Fang, M. Interlandi, T. Condie, and J. Cong. Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter
Scale. In Proceedings of the Seventh ACM Symposium on Cloud Computing, pages
456–469, 2016.
[55] InAccel.
Coral FPGA Resource
coral-fpga-resource-manager/, 2018.

Manager.

https://inaccel.com/

[56] Intel. Intel SoC FPGAs. https://www.intel.com/content/www/us/en/products/
programmable/soc.html.
[57] Intel. Intel Processors and FPGAs - Better Together. https://itpeernetwork.
intel.com/intel-processors-fpga-better-together/#gs.73kkg1, 2018.
[58] Intel. Intel Quartus Prime Standard Edition User Guide: Partial Reconfiguration.
https://www.intel.com/content/www/us/en/programmable/documentation/
wck1529450731513.html, 2018.
[59] A. K. Jain, D. L. Maskell, and S. A. Fahmy. Throughput oriented FPGA overlays
using DSP blocks. In 2016 Design, Automation & Test in Europe Conference &
Exhibition (DATE), pages 1628–1633. IEEE, 2016.
[60] S. Jeloka et al. A 28 nm Configurable Memory (TCAM/BCAM/SRAM) Using PushRule 6T Bit Cell Enabling Logic-in-Memory. JSC, 51(4), 2016.
[61] W. Jiang. Scalable ternary content addressable memory implementation using fpgas.
In ANCS, pages 71–82, Oct 2013.
[62] S. H. Jo et al. 3D-Stackable Crossbar Resistive Memory Based on Field Assisted
Superlinear Threshold (FAST) Selector. In IEDM, pages 6.7.1–6.7.4, Dec 2014.
[63] S.-W. Jun, M. Liu, S. Lee, J. Hicks, J. Ankcorn, M. King, S. Xu, et al. BlueDBM:
An Appliance for Big Data Analytics. In ISCA, pages 1–13. IEEE, 2015.
[64] C. Kao. Benefits of Partial Reconfiguration. Xcell journal, 55:65–67, 2005.
[65] A. Kawahara, R. Azuma, Y. Ikeda, K. Kawai, Y. Katoh, Y. Hayakawa, K. Tsuji,
S. Yoneda, A. Himeno, K. Shimakawa, et al. An 8 mb multi-layered cross-point
reram macro with 443 mb/s write throughput. IEEE Journal of Solid-State Circuits,
48(1):178–185, 2013.
[66] D. Kawakami, Y. Shibata, and H. Amano. A Prototype Chip of Multicontext FPGA
with DRAM for Virtual Hardware. In Proceedings of the 2001 Asia and South Pacific
Design Automation Conference, pages 17–18. ACM, 2001.
[67] A. Khawaja, J. Landgraf, R. Prakash, M. Wei, E. Schkufza, and C. J. Rossbach.
Sharing, Protection, and Compatibility for Reconfigurable Fabric with AmorphOS.
In 13th {USENIX} Symposium on Operating Systems Design and Implementation
({OSDI} 18), pages 107–127, 2018.

160

[68] W.-S. Khwa, J.-J. Chen, J.-F. Li, X. Si, E.-Y. Yang, X. Sun, R. Liu, P.-Y. Chen, Q. Li,
S. Yu, et al. A 65nm 4kb algorithm-dependent computing-in-memory sram unit-macro
with 2.3 ns and 55.8 tops/w fully parallel product-sum operation for binary dnn edge
processors. In 2018 IEEE International Solid-State Circuits Conference-(ISSCC),
pages 496–498. IEEE, 2018.
[69] M. Kim and P. Smaragdis. Bitwise neural networks. arXiv preprint arXiv:1601.06071,
2016.
[70] R. Kirchgessner, G. Stitt, A. George, and H. Lam. VirtualRC: A Virtual FPGA
Platform for Applications and Tools Portability. In Proceedings of the ACM/SIGDA
international symposium on Field Programmable Gate Arrays, pages 205–208, 2012.
[71] O. Knodel, P. R. Genssler, and R. G. Spallek. Virtualizing Reconfigurable Hardware
to Provide Scalability in Cloud Architectures. Reconfigurable Architectures, Tools and
Applications, RECATA, 2017.
[72] O. Knodel and R. G. Spallek. Computing Framework for Dynamic Integration of Reconfigurable Resources in a Cloud. In 2015 Euromicro Conference on Digital System
Design, pages 337–344. IEEE, 2015.
[73] O. Knodel and R. G. Spallek. RC3E: Provision and Management of Reconfigurable
Hardware Accelerators in a Cloud Environment. arXiv preprint arXiv:1508.06843,
2015.
[74] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing for scalable image
search. In Computer Vision, 2009 IEEE 12th International Conference on, pages
2130–2137. IEEE, 2009.
[75] P. Kumbhare and V. Krishna. Designing High-Performance Video Systems in 7 Series
FPGAs with the AXI Interconnect. Xilinx, Inc., San Jose, CA, USA, Appl. Note,
7:1–24, 2012.
[76] B. S. Landman and R. L. Russo. On a pin versus block relationship for partitions of
logic graphs. IEEE Transactions on computers, 100(12):1469–1479, 1971.
[77] C. Lavin and A. Kaviani. RapidWright: Enabling Custom Crafted Implementations
for FPGAs. In 26th Annual International Symposium on Field-Programmable Custom
Computing Machines, pages 133–140. IEEE, 2018.
[78] C. Lavin and A. Kaviani. Build Your Own Domain-Specific Solutions with RapidWright: Invited Tutorial. In Proceedings of the 2019 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays, FPGA ’19, page 14–22, New York,
NY, USA, 2019. Association for Computing Machinery.
[79] M.-J. Lee et al. A Fast, High-Endurance and Scalable Non-Volatile Memory Device Made From Asymmetric T a2 O5−x /T aO2−x Bilayer Structures. Nature materials,
10(8):625–630, 2011.

161

[80] T.-Y. Lee, C.-C. Hu, L.-W. Lai, and C.-C. Tsai. Hardware Context-Switch Methodology for Dynamically Partially Reconfigurable Systems. Journal of Information Science
and Engineering, 26(4):1289–1305, 2010.
[81] J. Li, R. K. Montoye, M. Ishii, and L. Chang. 1 Mb 0.41 µm2 2T-2R Cell Nonvolatile
TCAM with Two-bit Encoding and Clocked Self-Referenced Sensing. IEEE Journal
of Solid-State Circuits, 49(4):896–907, 2014.
[82] S. Li, H. Lim, V. W. Lee, J. H. Ahn, A. Kalia, M. Kaminsky, D. G. Andersen,
O. Seongil, S. Lee, and P. Dubey. Architecting to Achieve a Billion Requests Per
Second Throughput on a Single Key-Value Store Server Platform. In ACM SIGARCH
Computer Architecture News, volume 43, pages 476–488. ACM, 2015.
[83] Y. Li, Z. Liu, K. Xu, H. Yu, and F. Ren. A 7.663-TOPS 8.2-W Energy-efficient
FPGA Accelerator for Binary Convolutional Neural Networks. In FPGA, pages 290–
291. ACM, 2017.
[84] Y. Y. Liauw et al. Nonvolatile 3D-FPGA With Monolithically Stacked RRAM-based
Configuration Memory. In ISSCC, pages 406–408. IEEE, 2012.
[85] C.-C. Lin et al. 7.4 A 256b-wordlength ReRAM-based TCAM with 1ns search-time
and 14× improvement in wordlength-energyefficiency-density product using 2.5 T1R
cell. In ISSCC, pages 136–137. IEEE, 2016.
[86] Y. Liu, Z. Wang, A. Lee, F. Su, C.-P. Lo, Z. Yuan, C.-C. Lin, Q. Wei, Y. Wang,
Y.-C. King, et al. 4.7 a 65nm reram-enabled nonvolatile processor with 6× reduction
in restore time and 4× higher clock frequency using adaptive data retention and selfwrite-termination nonvolatile logic. In 2016 IEEE International Solid-State Circuits
Conference (ISSCC), pages 84–86. IEEE, 2016.
[87] J. Luu et al. VTR 7.0: Next generation architecture and CAD system for FPGAs.
TRETS, 7(2):6, 2014.
[88] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe lsh: efficient indexing for high-dimensional similarity search. In Proceedings of the 33rd international
conference on Very large data bases, pages 950–961. VLDB Endowment, 2007.
[89] R. Lysecky, K. Miller, F. Vahid, and K. Vissers. Firm-core Virtual FPGA for Just-inTime FPGA Compilation. In Proceedings of the 2005 ACM/SIGDA 13th international
symposium on Field-programmable gate arrays, pages 271–271, 2005.
[90] M. I. Masud. FPGA Routing Structures: A Novel Switch Block and Depopulated
Interconnect Matrix Architectures. Master’s thesis, University of British Columbia,
1999.
[91] L. McMurchie and C. Ebeling. PathFinder: A Negotiation-based Performance-Driven
Router for FPGAs. In FPGA, pages 111–117. ACM, 1995.
[92] R. Menotti, J. M. Cardoso, M. M. Fernandes, and E. Marques. Automatic Generation
of FPGA Hardware Accelerators Using A Domain Specific Language. In FPL, pages
457–461. IEEE, 2009.
162

[93] A. Mishchenko, S. Chatterjee, R. Brayton, and N. Een. Improvements to Combinational Equivalence Checking. In 2006 IEEE/ACM International Conference on
Computer Aided Design, pages 836–843. IEEE, 2006.
[94] A. Mishchenko, S. Cho, S. Chatterjee, and R. Brayton. Combinational and Sequential
Mapping with Priority Cuts. In Proceedings of the 2007 IEEE/ACM international
conference on Computer-aided design, pages 354–361. IEEE Press, 2007.
[95] B. Moons, D. Bankman, L. Yang, B. Murmann, and M. Verhelst. Binareye: An
always-on energy-accuracy-scalable binary cnn processor with all memory on chip in
28nm cmos. In 2018 IEEE Custom Integrated Circuits Conference (CICC), pages 1–4.
IEEE, 2018.
[96] K. E. Murray, O. Petelin, S. Zhong, J. M. Wang, M. Eldafrawy, J.-P. Legault, E. Sha,
A. G. Graham, J. Wu, M. J. Walker, et al. VTR 8: High-performance cad and customizable fpga architecture modelling. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 13(2):1–55, 2020.
[97] S. Narang and G. Diamos.
Baidu DeepBench.
baidu-research/DeepBench, 2017.

https://github.com/

[98] R. Njuguna. A survey of FPGA benchmarks. Project Report, November, 24, 2008.
[99] E. Nurvitadhi, D. Sheffield, J. Sim, A. Mishra, G. Venkatesh, and D. Marr. Accelerating binarized neural networks: Comparison of fpga, cpu, gpu, and asic. Proc. ICFPT,
2016.
[100] T. Oguntebi and K. Olukotun. GraphOps: A Dataflow Library for Graph Analytics
Acceleration. In FPGA, pages 111–117. ACM, 2016.
[101] M. M. Ozdal, S. Yesil, T. Kim, A. Ayupov, J. Greth, S. Burns, and O. Ozturk.
Energy Efficient Architecture for Graph Analytics Accelerators. In ISCA, pages 166–
177. IEEE, 2016.
[102] M. A. Özkan, O. Reiche, F. Hannig, and J. Teich. FPGA-based Accelerator Design
From A Domain-Specific Language. In FPL, pages 1–9. IEEE, 2016.
[103] Panasonic. Panasonic Starts World’s First Mass Production of ReRAM Mounted
Microcomputers, 2013.
[104] J. Park, H. Sharma, D. Mahajan, J. K. Kim, P. Olds, and H. Esmaeilzadeh. Scale-out
Acceleration for Machine Learning. In MICRO. ACM, 2017.
[105] R. Prabhakar, Y. Zhang, D. Koeplinger, M. Feldman, T. Zhao, S. Hadjis, A. Pedram,
C. Kozyrakis, and K. Olukotun. Plasticine: A Reconfigurable Architecture for Parallel
Patterns. In 2017 ACM/IEEE 44th Annual International Symposium on Computer
Architecture (ISCA), pages 389–402. IEEE, 2017.
[106] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Demme,
H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, et al. A Reconfigurable Fabric for
163

Accelerating Large-Scale Datacenter Services. In 2014 ACM/IEEE 41st International
Symposium on Computer Architecture (ISCA), pages 13–24. IEEE, 2014.
[107] C. Ramesh, S. B. Patil, S. N. Dhanuskodi, G. Provelengios, S. Pillement, D. Holcomb,
and R. Tessier. FPGA Side Channel Attacks Without Physical Access. In 2018 IEEE
26th Annual International Symposium on Field-Programmable Custom Computing
Machines (FCCM), pages 45–52. IEEE, 2018.
[108] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating
MapReduce for Multi-core and Multiprocessor Systems. In High Performance Computer Architecture, 2007. HPCA 2007. IEEE 13th International Symposium on, pages
13–24. Ieee, 2007.
[109] W. Rao et al. Logic Mapping in Crossbar-based Nanoarchitectures. IEEE Design &
Test of Computers, 26(1):68–77, 2009.
[110] D. Ravichandran, P. Pantel, and E. Hovy. Randomized algorithms and nlp: using
locality sensitive hash function for high speed noun clustering. In Proceedings of
the 43rd annual meeting on association for computational linguistics, pages 622–629.
Association for Computational Linguistics, 2005.
[111] K. Saban. Xilinx Stacked Silicon Interconnect Technology Delivers Breakthrough
FPGA Capacity, Bandwidth, and Power Efficiency. Xilinx, White Paper, 1(1):1–10,
2011.
[112] D. Sheffield. IvyTown Xeon + FPGA: The HARP Program. https://cpufpga.
files.wordpress.com/2016/04/harp_isca_2016_final.pdf, 2016.
[113] R. S. Shenoy et al. MIEC (Mixed-Ionic-Electronic-Conduction)-Based Access Devices
for Non-volatile Crossbar Memory Arrays. Semiconductor Science and Technology,
29(10):104005, 2014.
[114] R. Shinde, A. Goel, P. Gupta, and D. Dutta. Similarity Search and Locality Sensitive
Hashing Using Ternary Content Addressable Memories. In Proceedings of the 2010
ACM SIGMOD International Conference on Management of data, pages 375–386.
ACM, 2010.
[115] M. Stoer and F. Wagner. A Simple Min-Cut Algorithm. Journal of the ACM (JACM),
44(4):585–591, 1997.
[116] F. Su, W.-H. Chen, L. Xia, C.-P. Lo, T. Tang, Z. Wang, K.-H. Hsu, M. Cheng, J.-Y.
Li, Y. Xie, et al. A 462gops/j rram-based nonvolatile intelligent processor for energy
harvesting ioe system featuring nonvolatile logics and processing-in-memory. In 2017
Symposium on VLSI Technology, pages T260–T261. IEEE, 2017.
[117] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-s. Seo, and
Y. Cao. Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale
Convolutional Neural Networks. In FPGA, pages 16–25. ACM, 2016.

164

[118] D. Suzuki, M. Natsui, A. Mochizuki, S. Miura, H. Honjo, H. Sato, S. Fukami, S. Ikeda,
T. Endoh, H. Ohno, et al. Fabrication of a 3000-6-input-luts embedded and blocklevel power-gated nonvolatile fpga chip using p-mtj-based logic-in-memory structure.
In 2015 Symposium on VLSI Circuits (VLSI Circuits), pages C172–C173. IEEE, 2015.
[119] N. Tarafdar, T. Lin, E. Fukuda, H. Bannazadeh, A. Leon-Garcia, and P. Chow.
Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center. In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 237–246, 2017.
[120] K. Tatsumura, M. Oda, and S. Yasuda.
A Pure-CMOS Nonvolatile MultiContext Configuration Memory for Dynamically Reconfigurable FPGAs. In FieldProgrammable Technology (FPT), 2014 International Conference on, pages 215–222.
IEEE, 2014.
[121] D. E. Taylor and J. S. Turner. Classbench: A Packet Classification Benchmark.
IEEE/ACM Transactions on Networking (TON), 15(3):499–511, 2007.
[122] A. C. Torrezan et al. Sub-nanosecond Switching of a Tantalum Oxide Memristor.
Nanotechnology, 22(48):485203, 2011.
[123] S. Trimberger, D. Carberry, A. Johnson, and J. Wong. A Time-Multiplexed FPGA.
In Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom
Computing Machines Cat. No. 97TB100186), pages 22–28. IEEE, 1997.
[124] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and
K. Vissers. Finn: A framework for fast, scalable binarized neural network inference.
arXiv preprint arXiv:1612.07119, 2016.
[125] A. Vaishnav, K. D. Pham, D. Koch, and J. Garside. Resource Elastic Virtualization
for FPGAs Using OpenCL. In 28th International Conference on Field Programmable
Logic and Applications, pages 111–1117. IEEE, 2018.
[126] C. Van Eijk. Sequential Equivalence Checking based on Structural Similarities.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
19(7):814–819, 2000.
[127] P. J. Van Laarhoven and E. H. Aarts. Simulated Annealing. In Simulated annealing:
Theory and applications, pages 7–15. Springer, 1987.
[128] C.-H. Wang et al. Three-Dimensional 4F 2 ReRAM Cell With CMOS Logic Compatible Process. In IEDM, pages 29–6. IEEE, 2010.
[129] S. Wang, Z. Li, C. Ding, B. Yuan, Q. Qiu, Y. Wang, and Y. Liang. C-LSTM:
Enabling Efficient LSTM using Structured Compression Techniques on FPGAs. In
FPGA, pages 11–20. ACM, 2018.
[130] J. Weerasinghe, F. Abel, C. Hagleitner, and A. Herkersdorf. Enabling FPGAs in
Hyperscale Data Centers. In 2015 IEEE 12th Intl Conf on Ubiquitous Intelligence
and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing
165

and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its
Associated Workshops (UIC-ATC-ScalCom), pages 1078–1086. IEEE, 2015.
[131] X. Wei, C. H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, and J. Cong. Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference
on FPGAs. In DAC, page 29. ACM, 2017.
[132] Z. Wei et al. Switching and Reliability Mechanisms for ReRAM. In IEEE International
Interconnect Technology Conference, pages 349–352, May 2014.
[133] Z. Wei, Y. Kanzawa, K. Arita, Y. Katoh, K. Kawai, S. Muraoka, S. Mitani, S. Fujii,
K. Katayama, M. Iijima, T. Mikawa, T. Ninomiya, R. Miyanaga, Y. Kawashima,
K. Tsuji, A. Himeno, T. Okada, R. Azuma, K. Shimakawa, H. Sugaya, T. Takagi,
R. Yasuhara, K. Horiba, H. Kumigashira, and M. Oshima. Highly reliable taox reram
and direct evidence of redox reaction mechanism. In 2008 IEEE International Electron
Devices Meeting, pages 1–4, Dec 2008.
[134] H.-S. P. Wong et al. Metal-oxide RRAM. Proceedings of the IEEE, 100(6), 2012.
[135] Y.-L. Wu and D. Chang. On the NP-completeness of regular 2-D FPGA routing
architectures and a novel solution. In ICCAD, pages 362–366. IEEE Computer Society
Press, 1994.
[136] Y. Xiao, S. T. Ahmed, and A. DeHon. Fast linking of separately-compiled fpga blocks
without a noc. In 2020 International Conference on Field-Programmable Technology
(ICFPT), 2020.
[137] Y. Xiao, A. DeHon, et al. PLD: Fast FPGA Compilation to Make Reconfigurable Acceleration Compatible with Modern Incremental Refinement Software Development.
In Proceedings of the 27th ACM International Conference on Architectural Support
for Programming Languages and Operating Systems, 2022.
[138] Y. Xiao, D. Park, A. Butt, H. Giesen, Z. Han, R. Ding, N. Magnezi, R. Rubin, and
A. DeHon. Reducing FPGA Compile Time with Separate Compilation for FPGA
Building Blocks. In 2019 International Conference on Field-Programmable Technology
(ICFPT), pages 153–161. IEEE, 2019.
[139] Xilinx.
UltraScale Architecture and Product Data Sheet:
https://www.xilinx.com/support/documentation/data_sheets/
ds890-ultrascale-overview.pdf.

Overview.

[140] Xilinx. Large FPGA Methodology Guide. https://www.xilinx.com/support/
documentation/sw_manuals/xilinx14_7/ug872_largefpga.pdf, 2012.
[141] Xilinx. UltraScale Architecture Configurable Logic Block. https://www.xilinx.
com/support/documentation/user_guides/ug574-ultrascale-clb.pdf, 2017.
[142] Xilinx.
Vivado Design Suite User Guide Hierarchical Design.
https:
//www.xilinx.com/support/documentation/sw_manuals/xilinx2017_1/
ug905-vivado-hierarchical-design.pdf, 2017.
166

[143] Xilinx.
Vivado Design Suite User Guide Partial Reconfiguration.
https:
//www.xilinx.com/support/documentation/sw_manuals/xilinx2018_1/
ug909-vivado-partial-reconfiguration.pdf, 2018.
[144] K. Yamazaki, Y. Nakajima, T. Hatano, and A. Miyazaki. Lagopus FPGA–A Reprogrammable Data Plane for High-Performance Software SDN Switches. In 2015 IEEE
Hot Chips 27 Symposium (HCS), pages 1–1. IEEE, 2015.
[145] S. Yang. Logic synthesis and optimization benchmarks user guide: version 3.0. MCNC,
1991.
[146] A. Yazdinejad, A. Bohlooli, and K. Jamshidi. Efficient Design and Hardware Implementation of the OpenFlow v1.3 Switch on the Virtex-6 FPGA ML605. The Journal
of Supercomputing, 74(3):1299–1320, 2018.
[147] S. Yin, P. Ouyang, J. Yang, T. Lu, X. Li, L. Liu, and S. Wei. An ultra-high energyefficient reconfigurable processor for deep neural networks with binary/ternary weights
in 28nm cmos. In 2018 IEEE Symposium on VLSI Circuits, pages 37–38. IEEE, 2018.
[148] S. Yin, P. Ouyang, S. Zheng, D. Song, X. Li, L. Liu, and S. Wei. A 141 uw, 2.46
pj/neuron binarized convolutional neural network based self-learning speech recognition processor in 28nm cmos. In 2018 IEEE Symposium on VLSI Circuits, pages
139–140. IEEE, 2018.
[149] S. Zeng, G. Dai, H. Sun, K. Zhong, G. Ge, K. Guo, Y. Wang, and H. Yang. Enabling Efficient and Flexible FPGA Virtualization for Deep Learning in the Cloud.
In 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom
Computing Machines (FCCM). IEEE, 2020.
[150] Y. Zha and J. Li. Reconfigurable in-memory computing with resistive memory crossbar. In Proceedings of the 35th International Conference on Computer-Aided Design,
pages 1–8, 2016.
[151] Y. Zha and J. Li. RRAM-based reconfigurable in-memory computing architecture
with hybrid routing. In 2017 IEEE/ACM International Conference on ComputerAided Design (ICCAD), pages 527–532. IEEE, 2017.
[152] Y. Zha and J. Li. Liquid silicon: A data-centric reconfigurable architecture enabled by
rram technology. In Proceedings of the 2018 ACM/SIGDA International Symposium
on Field-Programmable Gate Arrays, pages 51–60, 2018.
[153] Y. Zha and J. Li. Liquid silicon-monona: A reconfigurable memory-oriented computing fabric with scalable multi-context support. ACM SIGPLAN Notices, 53(2):214–
228, 2018.
[154] Y. Zha and J. Li. Virtualizing FPGAs in the Cloud. In Proceedings of the TwentyFifth International Conference on Architectural Support for Programming Languages
and Operating Systems, pages 845–858, 2020.

167

[155] Y. Zha and J. Li. Hetero-ViTAL: a virtualization stack for heterogeneous FPGA
clusters. In 2021 ACM/IEEE 48th Annual International Symposium on Computer
Architecture (ISCA), pages 470–483. IEEE, 2021.
[156] Y. Zha and J. Li. When application-specific ISA meets FPGAs: a multi-layer virtualization framework for heterogeneous cloud FPGAs. In Proceedings of the 26th ACM
International Conference on Architectural Support for Programming Languages and
Operating Systems, pages 123–134, 2021.
[157] Y. Zha, E. Nowak, and J. Li. Liquid Silicon: A Nonvolatile Fully Programmable
Processing-In-Memory Processor with Monolithically Integrated ReRAM for Big
Data/Machine Learning Applications. In 2019 Symposium on VLSI Circuits, pages
C206–C207. IEEE, 2019.
[158] Y. Zha, E. Nowak, and J. Li. Liquid silicon: A nonvolatile fully programmable
processing-in-memory processor with monolithically integrated ReRAM. IEEE Journal of Solid-State Circuits, 55(4):908–919, 2020.
[159] J. Zhang and J. Li. Improving the Performance of OpenCL-based FPGA Accelerator
for Convolutional Neural Network. In FPGA, pages 25–34. ACM, 2017.
[160] J. Zhang, Y. Xiong, N. Xu, R. Shu, B. Li, P. Cheng, G. Chen, and T. Moscibroda.
The Feniks FPGA Operating System for Cloud Computing. In Proceedings of the 8th
Asia-Pacific Workshop on Systems, pages 1–7, 2017.
[161] M. Zhao and G. E. Suh. FPGA-based Remote Power Side-Channel Attacks. In 2018
IEEE Symposium on Security and Privacy, pages 229–244. IEEE, 2018.
[162] Q. Zhao, T. Nakamichi, M. Amagasaki, M. Iida, M. Kuga, and T. Sueyoshi. hCODE:
An Open-Source Platform for FPGA Accelerators. In International Conference on
Field-Programmable Technology, pages 205–208. IEEE, 2016.
[163] R. Zhao, W. Song, W. Zhang, T. Xing, J.-H. Lin, M. Srivastava, R. Gupta, and
Z. Zhang. Accelerating Binarized Convolutional Neural Networks with SoftwareProgrammable FPGAs. In FPGA, pages 15–24. ACM, 2017.
[164] Y. Zhou, U. Gupta, S. Dai, R. Zhao, N. Srivastava, H. Jin, J. Featherston, Y.-H.
Lai, G. Liu, G. A. Velasquez, W. Wang, and Z. Zhang. Rosetta: A Realistic HighLevel Synthesis Benchmark Suite for Software-Programmable FPGAs. Int’l Symp. on
Field-Programmable Gate Arrays (FPGA), Feb 2018.
[165] D. Ziakas, A. Baum, R. A. Maddox, and R. J. Safranek. Intel® QuickPath Interconnect Architectural Features Supporting Scalable System Architectures. In 18th
Symposium on High Performance Interconnects, pages 1–6. IEEE, 2010.

168

