27 research outputs found
Mapping applications onto FPGA-centric clusters
High Performance Computing (HPC) is becoming increasingly important throughout science and engineering as ever more complex problems must be solved through computational simulations. In these large computational applications, the latency of communication between processing nodes is often the key factor that limits performance. An emerging alternative computer architecture that addresses the latency problem is the FPGA-centric cluster (FCC); in these systems, the devices (FPGAs) are directly interconnected and thus many layers of hardware and software are avoided. The result can be scalability not currently achievable with other technologies.
In FCCs, FPGAs serve multiple functions: accelerator, network interface card (NIC), and router. Moreover, because FPGAs are configurable, there is substantial opportunity to tailor the router hardware to the application; previous work has demonstrated that such application-aware configuration can effect a substantial improvement in hardware efficiency. One constraint of FCCs is that it is convenient for their interconnect to be static, direct, and have a two or three dimensional mesh topology. Thus, applications that are naturally of a different dimensionality (have a different logical topology) from that of the FCC must be remapped to obtain optimal performance.
In this thesis we study various aspects of the mapping problem for FCCs. There are two major research thrusts. The first is finding the optimal mapping of logical to physical topology. This problem has received substantial attention by both the theory community, where topology mapping is referred to as graph embedding, and by the High Performance Computing (HPC) community, where it is a question of process placement. We explore the implications of the different mapping strategies on communication behavior in FCCs, especially on resulting load imbalance.
The second major research thrust is built around the hypothesis that applications that need to be remapped (due to differing logical and physical topologies) will have different optimal router configurations from those applications that do not. For example, due to remapping, some virtual or physical communication links may have little occupancy; therefore fewer resources should be allocated to them. Critical here is the creation of a new set of parameterized hardware features that can be configured to best handle load imbalances caused by remapping. These two thrusts form a codesign loop: certain mapping algorithms may be differentially optimal due to application-aware router reconfiguration that accounts for this mapping.
This thesis has four parts. The first part introduces the background and previous work related to communication in general and, in particular, how it is implemented in FCCs. We build on previous work on application-aware router configuration. The second part introduces topology mapping mechanisms including those derived from graph embeddings and a greedy algorithm commonly used in HPC. In the third part, topology mappings are evaluated for performance and imbalance; we note that different mapping strategies lead to different imbalances both in the overall network and in each node. The final part introduces reconfigure router design that allocates resources based on different imbalance situations caused by different mapping behaviors
Enabling Shared Memory Communication in Networks of MPSoCs
Ongoing transistor scaling and the growing complexity of embedded system designs has led to the rise of MPSoCs (MultiโProcessor SystemโonโChip), combining multiple hardโcore CPUs and accelerators (FPGA, GPU) on the same physical die. These devices are of great interest to the supercomputing community, who are increasingly reliant on heterogeneity to achieve power and performance goals in these closing stages of the race to exascale. In this paper, we present a network interface architecture and networking infrastructure, designed to sit inside the FPGA fabric of a cuttingโedge MPSoC device, enabling networks of these devices to communicate within both a distributed and shared memory context, with reduced need for costly software networking system calls. We will present our implementation and prototype system and discuss the main design decisions relevant to the use of the Xilinx Zynq Ultrascale+, a stateโofโtheโart MPSoC, and the challenges to be overcome given the device's limitations and constraints. We demonstrate the working prototype system connecting two MPSoCs, with communication between processor and remote memory region and accelerator. We then discuss the limitations of the current implementation and highlight areas of improvement to make this solution productionโready
Interconnect architectures for dynamically partially reconfigurable systems
Dynamically partially reconfigurable FPGAs (Field-Programmable Gate Arrays) allow
hardware modules to be placed and removed at runtime while other parts of the system
keep working. With their potential benefits, they have been the topic of a great
deal of research over the last decade. To exploit the partial reconfiguration capability of
FPGAs, there is a need for efficient, dynamically adaptive communication infrastructure
that automatically adapts as modules are added to and removed from the system.
Many bus and network-on-chip (NoC) architectures have been proposed to exploit this
capability on FPGA technology. However, few realizations have been reported in the
public literature to demonstrate or compare their performance in real world applications.
While partial reconfiguration can offer many benefits, it is still rarely exploited in practical
applications. Few full realizations of partially reconfigurable systems in current
FPGA technologies have been published. More application experiments are required to
understand the benefits and limitations of implementing partially reconfigurable systems
and to guide their further development. The motivation of this thesis is to fill this
research gap by providing empirical evidence of the cost and benefits of different interconnect
architectures. The results will provide a baseline for future research and will
be directly useful for circuit designers who must make a well-reasoned choice between
the alternatives.
This thesis contains the results of experiments to compare different NoC and bus interconnect
architectures for FPGA-based designs in general and dynamically partially
reconfigurable systems. These two interconnect schemes are implemented and evaluated
in terms of performance, area and power consumption using FFT (Fast Fourier
Transform) andANN(Artificial Neural Network) systems as benchmarks. Conclusions
drawn from these results include recommendations concerning the interconnect approach
for different kinds of applications. It is found that a NoC provides much better
performance than a single channel bus and similar performance to a multi-channel bus
in both parallel and parallel-pipelined FFT systems. This suggests that a NoC is a better choice for systems with multiple simultaneous communications like the FFT. Bus-based
interconnect achieves better performance and consume less area and power than NoCbased
scheme for the fully-connected feed-forward NN system. This suggests buses
are a better choice for systems that do not require many simultaneous communications
or systems with broadcast communications like a fully-connected feed-forward NN.
Results from the experiments with dynamic partial reconfiguration demonstrate that
buses have the advantages of better resource utilization and smaller reconfiguration
time and memory than NoCs. However, NoCs are more flexible and expansible. They
have the advantage of placing almost all of the communication infrastructure in the
dynamic reconfiguration region. This means that different applications running on the
FPGA can use different interconnection strategies without the overhead of fixed bus
resources in the static region.
Another objective of the research is to examine the partial reconfiguration process and
reconfiguration overhead with current FPGA technologies. Partial reconfiguration allows
users to efficiently change the number of running PEs to choose an optimal powerperformance
operating point at the minimum cost of reconfiguration. However, this
brings drawbacks including resource utilization inefficiency, power consumption overhead
and decrease in system operating frequency. The experimental results report a
50% of resource utilization inefficiency with a power consumption overhead of less
than 5% and a decrease in frequency of up to 32% compared to a static implementation.
The results also show that most of the drawbacks of partial reconfiguration implementation
come from the restrictions and limitations of partial reconfiguration design flow.
If these limitations can be addressed, partial reconfiguration should still be considered
with its potential benefits.Thesis (Ph.D.) -- University of Adelaide, School of Electrical and Electronic Engineering, 201
High performance communication on reconfigurable clusters
High Performance Computing (HPC) has matured to where it is an essential third pillar, along with theory and experiment, in most domains of science and engineering. Communication latency is a key factor that is limiting the performance of HPC, but can be addressed by integrating communication into accelerators. This integration allows accelerators to communicate with each other without CPU interactions, and even bypassing the network stack. Field Programmable Gate Arrays (FPGAs) are the accelerators that currently best integrate communication with computation. The large number of Multi-gigabit Transceivers (MGTs) on most high-end FPGAs can provide high-bandwidth and low-latency inter-FPGA connections. Additionally, the reconfigurable FPGA fabric enables tight coupling between computation kernel and network interface.
Our thesis is that an application-aware communication infrastructure for a multi-FPGA system makes substantial progress in solving the HPC communication bottleneck. This dissertation aims to provide an application-aware solution for communication infrastructure for FPGA-centric clusters. Specifically, our solution demonstrates application-awareness across multiple levels in the network stack, including low-level link protocols, router microarchitectures, routing algorithms, and applications.
We start by investigating the low-level link protocol and the impact of its latency variance on performance. Our results demonstrate that, although some link jitter is always present, we can still assume near-synchronous communication on an FPGA-cluster. This provides the necessary condition for statically-scheduled routing. We then propose two novel router microarchitectures for two different kinds of workloads: a wormhole Virtual Channel (VC)-based router for workloads with dynamic communication, and a statically-scheduled Virtual Output Queueing (VOQ)-based router for workloads with static communication. For the first (VC-based) router, we propose a framework that generates application-aware router configurations. Our results show that, by adding application-awareness into router configuration, the network performance of FPGA clusters can be substantially improved. For the second (VOQ-based) router, we propose a novel offline collective routing algorithm. This shows a significant advantage over a state-of-the-art collective routing algorithm.
We apply our communication infrastructure to a critical strong-scaling HPC kernel, the 3D FFT. The experimental results demonstrate that the performance of our design is faster than that on CPUs and GPUs by at least one order of magnitude (achieving strong scaling for the target applications). Surprisingly, the FPGA cluster performance is similar to that of an ASIC-cluster. We also implement the 3D FFT on another multi-FPGA platform: the Microsoft Catapult II cloud. Its performance is also comparable or superior to CPU and GPU HPC clusters. The second application we investigate is Molecular Dynamics Simulation (MD). We model MD on both FPGA clouds and clusters. We find that combining processing and general communication in the same device leads to extremely promising performance and the prospect of MD simulations well into the us/day range with a commodity cloud
Doctor of Philosophy
dissertationCommunication surpasses computation as the power and performance bottleneck in forthcoming exascale processors. Scaling has made transistors cheap, but on-chip wires have grown more expensive, both in terms of latency as well as energy. Therefore, the need for low energy, high performance interconnects is highly pronounced, especially for long distance communication. In this work, we examine two aspects of the global signaling problem. The first part of the thesis focuses on a high bandwidth asynchronous signaling protocol for long distance communication. Asynchrony among intellectual property (IP) cores on a chip has become necessary in a System on Chip (SoC) environment. Traditional asynchronous handshaking protocol suffers from loss of throughput due to the added latency of sending the acknowledge signal back to the sender. We demonstrate a method that supports end-to-end communication across links with arbitrarily large latency, without limiting the bandwidth, so long as line variation can be reliably controlled. We also evaluate the energy and latency improvements as a result of the design choices made available by this protocol. The use of transmission lines as a physical interconnect medium shows promise for deep submicron technologies. In our evaluations, we notice a lower energy footprint, as well as vastly reduced wire latency for transmission line interconnects. We approach this problem from two sides. Using field solvers, we investigate the physical design choices to determine the optimal way to implement these lines for a given back-end-of-line (BEOL) stack. We also approach the problem from a system designer's viewpoint, looking at ways to optimize the lines for different performance targets. This work analyzes the advantages and pitfalls of implementing asynchronous channel protocols for communication over long distances. Finally, the innovations resulting from this work are applied to a network-on-chip design example and the resulting power-performance benefits are reported
Master of Science
thesisIntegrated circuits often consist of multiple processing elements that are regularly tiled across the two-dimensional surface of a die. This work presents the design and integration of high speed relative timed routers for asynchronous network-on-chip. It researches NoC's efficiency through simplicity by directly translating simple T-router, source-routing, single-flit packet to higher radix routers. This work is intended to study performance and power trade-offs adding higher radix routers, 3D topologies, Virtual Channels, Accurate NoC modeling, and Transmission line communication links. Routers with and without virtual channels are designed and integrated to arrayed communication networks. Furthermore, the work investigates 3D networks with diffusive RC wires and transmission lines on long wrap interconnects
์จ ์นฉ ๋คํธ์ํฌ ์ค๊ณ: ๋งคํ, ๊ด๋ฆฌ, ๋ผ์ฐํ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2016. 2. ์ต๊ธฐ์.์ง๋ ์์ญ ๋
๊ฐ ์ด์ด์ง ๋ฐ๋์ฒด ๊ธฐ์ ์ ํฅ์์ ๋งค๋ ์ฝ์ด์ ์๋๋ฅผ ๊ฐ์ ธ๋ค ์ฃผ์๋ค.
์ฐ๋ฆฌ๊ฐ ์ผ์ ์ํ์ ์ฐ๋ ๋ฐ์คํฌํฑ ์ปดํจํฐ์กฐ์ฐจ๋ ์ด๋ฏธ ์ ๊ฐ์ ์ฝ์ด๋ฅผ ๊ฐ์ง๊ณ ์์ผ๋ฉฐ, ์๋ฐฑ ๊ฐ์ ์ฝ์ด๋ฅผ ๊ฐ์ง ์นฉ๋ ์์ฉํ๋์ด ์๋ค.
์ด๋ฌํ ๋ง์ ์ฝ์ด๋ค ๊ฐ์ ํต์ ๊ธฐ๋ฐ์ผ๋ก์, ๋คํธ์ํฌ-์จ-์นฉ(NoC)์ด ์๋ก์ด ๋๋๋์์ผ๋ฉฐ, ์ด๋ ํ์ฌ ๋ง์ ์ฐ๊ตฌ ๋ฐ ์์ฉ ์ ํ์์ ๋๋ฆฌ ์ฌ์ฉ๋๊ณ ์๋ค. ๊ทธ๋ฌ๋ ๋คํธ์ํฌ-์จ-์นฉ์ ๋งค๋ ์ฝ์ด ์์คํ
์ ์ฌ์ฉํ๋ ๋ฐ์๋ ์ฌ๋ฌ ๊ฐ์ง ๋ฌธ์ ๊ฐ ๋ฐ๋ฅด๋ฉฐ, ๋ณธ ๋
ผ๋ฌธ์์๋ ๊ทธ ์ค ๋ช ๊ฐ์ง๋ฅผ ํ์ด๋ด๊ณ ์ ํ์๋ค.
๋ณธ ๋
ผ๋ฌธ์ ๋ ๋ฒ์งธ ์ฑํฐ์์๋ NoC ๊ธฐ๋ฐ ๋งค๋์ฝ์ด ๊ตฌ์กฐ์ ์์
์ ํ ๋นํ๊ณ ์ค์ผ์ฅดํ๋ ๋ฐฉ๋ฒ์ ๋ค๋ฃจ์๋ค. ๋งค๋์ฝ์ด์์ ์์
ํ ๋น์ ๋ค๋ฃฌ ๋
ผ๋ฌธ์ ์ด๋ฏธ ๋ง์ด ์ถํ๋์์ง๋ง, ๋ณธ ์ฐ๊ตฌ๋ ๋ฉ์์ง ํจ์ฑ๊ณผ ๊ณต์ ๋ฉ๋ชจ๋ฆฌ, ๋ ๊ฐ์ง์ ํต์ ๋ฐฉ์์ ๊ณ ๋ คํจ์ผ๋ก์จ ์ฑ๋ฅ๊ณผ ์๋์ง ํจ์จ์ ๊ฐ์ ํ์๋ค. ๋ํ, ๋ณธ ์ฐ๊ตฌ๋ ์ญ๋ฐฉํฅ ์์กด์ฑ์ ๊ฐ์ง ์์
๊ทธ๋ํ๋ฅผ ์ค์ผ์ฅดํ๋ ๋ฐฉ๋ฒ ๋ํ ์ ์ํ์๋ค.
3์ฐจ์ ์ ์ธต ๊ธฐ์ ์ ๋์์ง ์ ๋ ฅ ๋ฐ๋ ๋๋ฌธ์ ์ด ๋ฌธ์ ๊ฐ ์ฌ๊ฐํด์ง๋ ๋ฑ, ์ฌ๋ฌ ๊ฐ์ง ๋์ ๊ณผ์ ๋ฅผ ๋ดํฌํ๊ณ ์๋ค. ์ธ ๋ฒ์งธ ์ฑํฐ์์๋ DVFS ๊ธฐ์ ์ ์ด์ฉํ์ฌ ์ด ๋ฌธ์ ๋ฅผ ์ํํ๊ณ ์ ํ๋ ๊ธฐ์ ์ ์๊ฐํ๋ค.
๊ฐ ์ฝ์ด์ ๋ผ์ฐํฐ๊ฐ ์ ์, ์๋ ์๋๋ฅผ ์กฐ์ ํ ์ ์๋ ๊ตฌ์กฐ์์, ๊ฐ์ฅ ๋์ ์ฑ๋ฅ์ ์ด๋์ด ๋ด๋ฉด์๋ ์ต๋ ์จ๋๋ฅผ ๋์ด์์ง ์๋๋ก ํ๋ค.
์ธ ๋ฒ์งธ์ ๋ค ๋ฒ์งธ ์ฑํฐ๋ ์กฐ๊ธ ๋ค๋ฅธ ์ธก๋ฉด์ ๋ค๋ฃฌ๋ค. 3D ์ ์ธต ๊ธฐ์ ์ ์ฌ์ฉํ ๋, ์ธต๊ฐ ํต์ ์ ์ฃผ๋ก TSV๋ฅผ ์ด์ฉํ์ฌ ์ด๋ฃจ์ด์ง๋ค. ๊ทธ๋ฌ๋ TSV๋ ์ผ๋ฐ wire๋ณด๋ค ํจ์ฌ ํฐ ๋ฉด์ ์ ์ฐจ์งํ๊ธฐ ๋๋ฌธ์, ์ ์ฒด ๋คํธ์ํฌ์์์ TSV ๊ฐ์๋ ์ ํ๋์ด์ผ ํ ๊ฒฝ์ฐ๊ฐ ๋ง๋ค. ์ด ๊ฒฝ์ฐ์๋ ๋ ๊ฐ์ง ์ ํ์ง๊ฐ ์๋๋ฐ, ์ฒซ์งธ๋ ๊ฐ ์ธต๊ฐ ํต์ ์ฑ๋์ ๋์ญํญ์ ์ค์ด๋ ๊ฒ์ด๊ณ , ๋์งธ๋ ๊ฐ ์ฑ๋์ ๋์ญํญ์ ์ ์งํ๋ ์ผ๋ถ ๋
ธ๋๋ง ์ธต๊ฐ ํต์ ์ด ๊ฐ๋ฅํ ์ฑ๋์ ์ ๊ณตํ๋ ๊ฒ์ด๋ค. ์ฐ๋ฆฌ๋ ๊ฐ๊ฐ์ ๊ฒฝ์ฐ์ ๋ํ์ฌ ๋ผ์ฐํ
์๊ณ ๋ฆฌ์ฆ์ ํ๋์ฉ ์ ์ํ๋ค.
์ฒซ ๋ฒ์งธ ๊ฒฝ์ฐ์ ์์ด์๋ deflection ๋ผ์ฐํ
๊ธฐ๋ฒ์ ์ฌ์ฉํ์ฌ ์ธต๊ฐ ํต์ ์ ๊ธด ์ง์ฐ ์๊ฐ์ ๊ทน๋ณตํ๊ณ ์ ํ์๋ค. ์ธต๊ฐ ํต์ ์ ๊ท ๋ฑํ๊ฒ ๋ถ๋ฐฐํจ์ผ๋ก์จ, ์ ์๋ ์๊ณ ๋ฆฌ์ฆ์ ๊ฐ์ ๋ ์ง์ฐ ์๊ฐ์ ๋ณด์ด๋ฉฐ ๋ผ์ฐํฐ ๋ฒํผ์ ์ ๊ฑฐ๋ฅผ ํตํ ๋ฉด์ ๋ฐ ์๋์ง ํจ์จ์ฑ ๋ํ ์ป์ ์ ์๋ค.
๋ ๋ฒ์งธ ๊ฒฝ์ฐ์์๋ ์ธต๊ฐ ํต์ ์ฑ๋์ ์ ํํ๊ธฐ ์ํ ๋ช ๊ฐ์ง ๊ท์น์ ์ ์ํ๋ค. ์ฝ๊ฐ์ ๋ผ์ฐํ
์์ ๋๋ฅผ ํฌ์ํจ์ผ๋ก์จ, ์ ์๋ ์๊ณ ๋ฆฌ์ฆ์ ๊ธฐ์กด ์๊ณ ๋ฆฌ์ฆ์ ๊ฐ์ ์ฑ๋ ์๊ตฌ ์กฐ๊ฑด์ ์ ๊ฑฐํ๊ณ , ๊ฒฐ๊ณผ์ ์ผ๋ก๋ ์ฑ๋ฅ ๋๋ ์๋์ง ํจ์จ์ ์ฆ๊ฐ๋ฅผ ๊ฐ์ ธ ์จ๋ค.For decades, advance in semiconductor technology has led us to the era of many-core systems. Today's desktop computers already have multi-core processors, and chips with more than a hundred cores are commercially available. As a communication medium for such a large number of cores, network-on-chip (NoC) has emerged out, and now is being used by many researchers and companies. Adopting NoC for a many-core system incurs many problems, and this thesis tries to solve some of them.
The second chapter of this thesis is on mapping and scheduling of tasks on NoC-based CMP architectures. Although mapping on NoC has a number of papers published, our work reveals that selecting communication types between shared memory and message passing can help improve the performance and energy efficiency. Additionally, our framework supports scheduling applications containing backward dependencies with the help of modified modulo scheduling.
Evolving the SoCs through 3D stacking makes us face a number of new problems, and the thermal problem coming from increased power density is one of them.
In the third chapter of this thesis, we try to mitigate the hotspot problem using DVFS techniques. Assuming that all the routers as well as cores have capabilities to control voltage and frequency individually, we find voltage-frequency pairs for all cores and routers which yields the best performance within the given thermal constraint.
The fourth and the fifth chapters of this thesis are from a different aspect. In 3D stacking, inter-layer interconnections are implemented using through-silicon vias (TSV). TSVs usually take much more area than normal wires. Furthermore, they also consume silicon area as well as metal area. For this reason, designers would want to limit the number of TSVs used in their network. To limit the TSV count, there are two options: the first is to reduce the width of each vertical links, and the other is to use fewer vertical links, which results in a partially connected network. We present two routing methodologies for each case.
For the network with reduced bandwidth vertical links, we propose using deflection routing to mitigate the long latency of vertical links. By balancing the vertical traffics properly, the algorithm provides improved latency.
Also, a large amount of area and energy reduction can be obtained by the removal of router buffers.
For partially connected networks, we introduce a set of routing rules for selecting the vertical links. At the expense of sacrificing some amount of routing freedom, the proposed algorithm removes the virtual channel requirement for avoiding deadlock. As a result, the performance, or energy consumption can be reduced at the designer's choice.Chapter 1 Introduction 1
1.1 Task Mapping and Scheduling 2
1.2 Thermal Management 3
1.3 Routing for 3D Networks 5
Chapter 2 Mapping and Scheduling 9
2.1 Introduction 9
2.2 Motivation 10
2.3 Background 12
2.4 Related Work 16
2.5 Platform Description 17
2.5.1 Architcture Description 17
2.5.2 Energy Model 21
2.5.3 Communication Delay Model 22
2.6 Problem Formulation 23
2.7 Proposed Solution 25
2.7.1 Task and Communication Mapping 27
2.7.2 Communication Type Optimization 31
2.7.3 Design Space Pruning via Pre-evaluation 34
2.7.4 Scheduling 35
2.8 Experimental Results 42
2.8.1 Experiments with Coarse-grained Iterative Modulo Scheduling 42
2.8.2 Comparison with Different Mapping Algorithms 43
2.8.3 Experiments with Overall Algorithms 45
2.8.4 Experiments with Various Local Memory Sizes 47
2.8.5 Experiments with Various Placements of Shared Memory 48
Chapter 3 Thermal Management 50
3.1 Introduction 50
3.2 Background 51
3.2.1 Thermal Modeling 51
3.2.2 Heterogeneity in Thermal Propagation 52
3.3 Motivation and Problem Definition 53
3.4 Related Work 56
3.5 Orchestrated Voltage-Frequency Assignment 56
3.5.1 Individual PI Control Method 56
3.5.2 PI Controlled Weighted-Power Budgeting 57
3.5.3 Performance/Power Estimation 59
3.5.4 Frequency Assignment 62
3.5.5 Algorithm Overview 64
3.5.6 Stability Conditions for PI Controller 65
3.6 Experimental Result 66
3.6.1 Experimental Setup 66
3.6.2 Overall Algorithm Performance 68
3.6.3 Accuracy of the Estimation Model 70
3.6.4 Performance of the Frequency Assignment Algorithm 70
Chapter 4 Routing for Limited Bandwidth 3D NoC 72
4.1 Introduction 72
4.2 Motivation 73
4.3 Background 74
4.4 Related Work 75
4.5 3D Deflection Routing 76
4.5.1 Serialized TSV Model 76
4.5.2 TSV Link Injection/ejection Scheme 78
4.5.3 Deadlock Avoidance 80
4.5.4 Livelock Avoidance 84
4.5.5 Router Architecture: Putting It All Together 86
4.5.6 System Level Consideration 87
4.6 Experimental Results 89
4.6.1 Experimental Setup 89
4.6.2 Results on Synthetic Traffic Patterns 91
4.6.3 Results on Realistic Traffic Patterns 94
4.6.4 Results on Real Application Benchmarks 98
4.6.5 Fairness Issue 103
4.6.6 Area Cost Comparison 104
Chapter 5 Routing for Partially Connected 3D NoC 106
5.1 Introduction 106
5.2 Background 107
5.3 Related Work 109
5.4 Proposed Algorithm 111
5.4.1 Preliminary 112
5.4.2 Routing Algorithm for 3-D Stacked Meshes with Regular Partial Vertical Connections 115
5.4.3 Routing Algorithm for 3-D Stacked Meshes with Irregular Partial Vertical Connections 118
5.4.4 Extension to Heterogeneous Mesh Layers 122
5.5 Experimental Results 126
5.5.1 Experimental Setup 126
5.5.2 Experiments on Synthetic Traffics 128
5.5.3 Experiments on Application Benchmarks 133
5.5.4 Comparison with Reduced Bandwidth Mesh 139
Chapter 6 Conclusion 141
Bibliography 144
์ด๋ก 163Docto
Optimization of communication intensive applications on HPC networks
Communication is a necessary but overhead inducing component of parallel programming. Its impact on application design and performance is due to several related aspects of a parallel job execution: network topology, routing protocol, suitability of algorithm being used to the network, job placement, etc. This thesis is aimed at developing an understanding of how communication plays out on networks of high performance computing systems and exploring methods that can be used to improve communication performance of large scale applications.
Broadly speaking, three topics have been studied in detail in this thesis. The first of these topics is task mapping and job placement on practical installations of torus and dragonfly networks. Next, use of supervised learning algorithms for conducting diagnostic studies of how communication evolves on networks is explored. Finally, efficacy of packet-level simulations for prediction-based studies of communication performance on different networks using different network parameters is analyzed.
The primary contribution of this thesis is development of scalable diagnostic and prediction methods that can assist in the process of network designing, adapting applications to future systems, and optimizing execution of applications on existing systems. These meth- ods include a supervised learning approach, a functional modeling tool (called Damselfly), and a PDES-based packet level simulator (called TraceR), all of which are described in this thesis
CROSS-LAYER DESIGN, OPTIMIZATION AND PROTOTYPING OF NoCs FOR THE NEXT GENERATION OF HOMOGENEOUS MANY-CORE SYSTEMS
This thesis provides a whole set of design methods to enable and manage the
runtime heterogeneity of features-rich industry-ready Tile-Based Networkon-
Chips at different abstraction layers (Architecture Design, Network Assembling,
Testing of NoC, Runtime Operation). The key idea is to maintain
the functionalities of the original layers, and to improve the performance
of architectures by allowing, joint optimization and layer coordinations. In
general purpose systems, we address the microarchitectural challenges by codesigning
and co-optimizing feature-rich architectures. In application-specific
NoCs, we emphasize the event notification, so that the platform is continuously
under control. At the network assembly level, this thesis proposes a
Hold Time Robustness technique, to tackle the hold time issue in synchronous
NoCs. At the network architectural level, the choice of a suitable synchronization
paradigm requires a boost of synthesis flow as well as the coexistence
with the DVFS. On one hand this implies the coexistence of mesochronous
synchronizers in the network with dual-clock FIFOs at network boundaries.
On the other hand, dual-clock FIFOs may be placed across inter-switch links
hence removing the need for mesochronous synchronizers. This thesis will
study the implications of the above approaches both on the design flow and
on the performance and power quality metrics of the network. Once the manycore
system is composed together, the issue of testing it arises. This thesis
takes on this challenge and engineers various testing infrastructures. At the
upper abstraction layer, the thesis addresses the issue of managing the fully
operational system and proposes a congestion management technique named
HACS. Moreover, some of the ideas of this thesis will undergo an FPGA
prototyping. Finally, we provide some features for emerging technology by
characterizing the power consumption of Optical NoC Interfaces