60 research outputs found
A Framework for Efficient Routing in MANET using Index Routing Tables-based Algorithms
Conventional network routing protocols rely on predefined numerical network unique node ID or group identifier for packet delivery, independent of semantic applications. This compels incorporation of resource/service discovery approaches in the design itself, at higher layers of network, causing additional overhead. This overhead, though tolerable in high speed wired networks, significantly restricts the performance in the infrastructure-less wireless ad hoc networks expending their limited battery resources, which are already consumed due to assigning unique identifiers to the naturally anonymous and high mobile nodes. This study proposes a single routing approach which facilitates descriptive and semantically-rich identification of networkโs resources/services. This fusion of the discovery processes of the resources and the path based on their similarity in a single phase significantly reduces traffic load and latency of communication considering the generality too. Further, a framework capable of exploiting application-specific semantics of messages, adaptable to diverse traffic patterns is proposed. Analytical results amply illustrate the scalability and efficacy of the proposed method
FatPaths: Routing in Supercomputers and Data Centers when Shortest Paths Fall Short
We introduce FatPaths: a simple, generic, and robust routing architecture
that enables state-of-the-art low-diameter topologies such as Slim Fly to
achieve unprecedented performance. FatPaths targets Ethernet stacks in both HPC
supercomputers as well as cloud data centers and clusters. FatPaths exposes and
exploits the rich ("fat") diversity of both minimal and non-minimal paths for
high-performance multi-pathing. Moreover, FatPaths uses a redesigned "purified"
transport layer that removes virtually all TCP performance issues (e.g., the
slow start), and incorporates flowlet switching, a technique used to prevent
packet reordering in TCP networks, to enable very simple and effective load
balancing. Our design enables recent low-diameter topologies to outperform
powerful Clos designs, achieving 15% higher net throughput at 2x lower latency
for comparable cost. FatPaths will significantly accelerate Ethernet clusters
that form more than 50% of the Top500 list and it may become a standard routing
scheme for modern topologies
On Energy Efficient Computing Platforms
In accordance with the Moore's law, the increasing number of on-chip integrated transistors has enabled modern computing platforms with not only higher processing power but also more affordable prices. As a result, these platforms, including portable devices, work stations and data centres, are becoming an inevitable part of the human society. However, with the demand for portability and raising cost of power, energy efficiency has emerged to be a major concern for modern computing platforms.
As the complexity of on-chip systems increases, Network-on-Chip (NoC) has been proved as an efficient communication architecture which can further improve system performances and scalability while reducing the design cost. Therefore, in this thesis, we study and propose energy optimization approaches based on NoC architecture, with special focuses on the following aspects.
As the architectural trend of future computing platforms, 3D systems have many bene ts including higher integration density, smaller footprint, heterogeneous integration, etc. Moreover, 3D technology can signi cantly improve the network communication and effectively avoid long wirings, and therefore, provide higher system performance and energy efficiency.
With the dynamic nature of on-chip communication in large scale NoC based systems, run-time system optimization is of crucial importance in order to achieve higher system reliability and essentially energy efficiency. In this thesis, we propose an agent based system design approach where agents are on-chip components which monitor and control system parameters such as supply voltage, operating frequency, etc. With this approach, we have analysed the implementation alternatives for dynamic voltage and frequency scaling and power gating techniques at different granularity, which reduce both dynamic and leakage energy consumption.
Topologies, being one of the key factors for NoCs, are also explored for energy saving purpose. A Honeycomb NoC architecture is proposed in this thesis with turn-model based deadlock-free routing algorithms. Our analysis and simulation based evaluation show that Honeycomb NoCs outperform their Mesh based counterparts in terms of network cost, system performance as well as energy efficiency.Siirretty Doriast
Application-aware deadlock-free oblivious routing
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 67-71).Systems that can be integrated on a single silicon die have become larger and increasingly complex, and wire designs as communication mechanisms for these systems on chip (SoC) have shown to be a limiting factor in their performance. As an approach to remove the limitation of communication and to overcome wire delays, interconnection networks or Network-on-Chip (NoC) architectures have emerged. NoC architectures enable faster data communication between components and are more scalable. In designing NoC systems, there are three key issues; the topology, which directly depends on packaging technology and manufacturing costs, dictates the throughput and latency bounds of the network; the flit control protocol, which establishes how the network resources are allocated to packets exchanged between components; and finally, the routing algorithm, which aims at optimizing network performance for some topology and flow control protocol by selecting appropriate paths for those packets. Since the routing algorithm sits on top of the other layers of design, it is critical that routing is done in a matter that makes good usage of the resources of the network. Two main approaches to routing, oblivious and adaptive, have been followed in creating routing algorithms for these systems. Each approach has its pros and cons; oblivious routing, as opposite to adaptive routing, uses no network state information in determining routes at the cost of lower performance on certain applications, but has been widely used because of its simpler hardware requirements.(cont.) This thesis examines oblivious routing schemes for NoC architectures. It introduces various non-minimal, oblivious routing algorithms that globally allocate network bandwidth for a given application when estimated bandwidths for data transfers are provided, while ensuring deadlock freedom with no significant additional hardware. The work presents and evaluates these oblivious routing algorithms which attempt to minimize the maximum channel load (MCL) across all network links in an effort to maximize application throughput. Simulation results from popular synthetic benchmarks and concrete applications, such as an H.264 decoder, show that it is possible to achieve better performance than traditional deterministic and oblivious routing schemes.by Michel A. Kinsy.S.M
Recommended from our members
Design and Optimization of Networks-on-Chip for Future Heterogeneous Systems-on-Chip
Due to the tight power budget and reduced time-to-market, Systems-on-Chip (SoC) have emerged as a power-efficient solution that provides the functionality required by target applications in embedded systems. To support a diverse set of applications such as real-time video/audio processing and sensor signal processing, SoCs consist of multiple heterogeneous components, such as software processors, digital signal processors, and application-specific hardware accelerators. These components offer different flexibility, power, and performance values so that SoCs can be designed by mix-and-matching them.
With the increased amount of heterogeneous cores, however, the traditional interconnects in an SoC exhibit excessive power dissipation and poor performance scalability. As an alternative, Networks-on-Chip (NoC) have been proposed. NoCs provide modularity at design-time because
communications among the cores are isolated from their computations via standard interfaces. NoCs also exploit communication parallelism at run-time because multiple data can be transferred simultaneously.
In order to construct an efficient NoC, the communication behaviors of various heterogeneous components in an SoC must be considered with the large amount of NoC design parameters. Therefore, providing an efficient NoC design and optimization framework is critical to reduce the design
cycle and address the complexity of future heterogeneous SoCs. This is the thesis of my dissertation.
Some existing design automation tools for NoCs support very limited degrees of automation that cannot satisfy the requirements of future heterogeneous SoCs. First, these tools only support a limited number of NoC design parameters. Second, they do not provide an integrated environment for software-hardware co-development.
Thus, I propose FINDNOC, an integrated framework for the generation, optimization, and validation of NoCs for future heterogeneous SoCs. The proposed framework supports software-hardware co-development, incremental NoC design-decision model, SystemC-based NoC customization and generation, and fast system protyping with FPGA emulations.
Virtual channels (VC) and multiple physical (MP) networks are the two main alternative methods to provide better performance, support quality-of-service, and avoid protocol deadlocks in packet-switched NoC design. To examine the effect of using VCs and MPs with other NoC architectural
parameters, I completed a comprehensive comparative analysis that combines an analytical model, synthesis-based designs for both FPGAs and standard-cell libraries, and system-level simulations.
Based on the results of this analysis, I developed VENTTI, a design and simulation environment that combines a virtual platform (VP), a NoC synthesis tool, and four NoC models characterized at different abstraction levels. VENTTI facilitates an incremental decision-making process with four
NoC abstraction models associated with different NoC parameters. The selected NoC parameters can be validated by running simulations with the corresponding model instantiated in the VP.
I augmented this framework to complete FINDNOC by implementing ICON, a NoC generation and customization tool that dynamically combines and customizes synthesizable SystemC components from a predesigned library. Thanks to its flexibility and automatic network interface generation
capabilities, ICON can generate a rich variety of NoCs that can be then integrated into any Embedded Scalable Platform (ESP) architectures for fast prototying with FPGA emulations.
I designed FINDNOC in a modular way that makes it easy to augmenting it with new capabilities. This, combined with the continuous progress of the ESP design methodology, will provide a seamless SoC integration framework, where the hardware accelerators, software applications, and
NoCs can be designed, validated, and integrated simultaneously, in order to reduce the design cycle of future SoC platforms
Optimization of communication intensive applications on HPC networks
Communication is a necessary but overhead inducing component of parallel programming. Its impact on application design and performance is due to several related aspects of a parallel job execution: network topology, routing protocol, suitability of algorithm being used to the network, job placement, etc. This thesis is aimed at developing an understanding of how communication plays out on networks of high performance computing systems and exploring methods that can be used to improve communication performance of large scale applications.
Broadly speaking, three topics have been studied in detail in this thesis. The first of these topics is task mapping and job placement on practical installations of torus and dragonfly networks. Next, use of supervised learning algorithms for conducting diagnostic studies of how communication evolves on networks is explored. Finally, efficacy of packet-level simulations for prediction-based studies of communication performance on different networks using different network parameters is analyzed.
The primary contribution of this thesis is development of scalable diagnostic and prediction methods that can assist in the process of network designing, adapting applications to future systems, and optimizing execution of applications on existing systems. These meth- ods include a supervised learning approach, a functional modeling tool (called Damselfly), and a PDES-based packet level simulator (called TraceR), all of which are described in this thesis
Cycle-accurate multicore performance models on FPGAs
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 159-165).The goal of this project is to improve computer architecture by accelerating cycle-accurate performance modeling of multicore processors using FPGAs. Contributions include a distributed technique controlling simulation on a highly-parallel substrate, hardware design techniques to reduce development effort, and a specific framework for modeling shared-memory multicore processors paired with realistic On-Chip Networks.by Michael Pellauer.Ph.D
Hi-Rise: A high-radix switch for 3D integration with single-cycle arbitration
Abstract-This paper proposes a novel 3D switch, called 'HiRise', that employs high-radix switches to efficiently route data across multiple stacked layers of dies. The proposed interconnect is hierarchical and composed of two switches per silicon layer and a set of dedicated layer to layer channels. However, a hierarchical 3D switch can lead to unfair arbitration across different layers. To address this, the paper proposes a unique class-based arbitration scheme that is fully integrated into the switching fabric, and is easy to implement. It makes the 3D hierarchical switch's fairness comparable to that of a flat 2D switch with least recently granted arbitration. The 3D switch is evaluated for different radices, number of stacked layers, and different 3D integration technologies. A 64-radix, 128-bit width, 4-layer Hi-Rise evaluated in a 32nm technology has a throughput of 10.65 Tbps for uniform random traffic. Compared to a 2D design this corresponds to a 15% improvement in throughput, a 33% area reduction, a 20% latency reduction, and a 38% energy per transaction reduction
์๋น์ค ๊ท ๋ฑ ๋ถ๋ฐฐ์ ๊ณ ์ฑ๋ฅ์ ์ํ ๋ค์คํ๋ก์ธ์์นฉ ์์ ์ฌ๊ตฌ์ฑํ ํต์ ๊ตฌ์กฐ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ์ ๊ธฐยท์ปดํจํฐ๊ณตํ๋ถ, 2016. 2. ์ต๊ธฐ์.The chip multiprocessor (CMP) era has long begun due to the diminishing return from instruction-level parallelism (ILP) harvesting techniques, the rising power and temperature from frequency scaling, etc. One powerful processor has been replaced by many less-powerful processors forming a CMP. One of the issues arose from this paradigm shift is the management of communication among the processors. Buses, which has been a common choice for the systems with one or several processors, failed to sustain the increased communication burden of CMPs. Many bus-based improvements including hierarchical buses and bus-matrices, were proposed but eventually, network-on-chip (NoC) has become the de facto standard for designing a CMP system, replacing the bus-based techniques.
NoCs strengths over bus mainly come from its capability of conveying multiple transactions simultaneously from different components to the others. The concurrent communications between the cores are conducted by the distributed, yet shared network components, routers. Routers provide cores with services such as bandwidths. One of the design issues in implementing NoC is to distribute these services evenly across all the cores requesting for them. Arbiter is a component that regulates the accesses to shared resources such as channels and buffers. It has the policy under which requests get services in turn from the shared resources so that the requestors dont fall into deadlock or starvation. One of the common policies for an arbiter is the round-robin, where requests get their grant one by one so that fairness is assured among the requestors. When applied to routers in NoC, it fails to provide the fairness because each request goes through multiple routers, thus multiple round-robin arbiters on a transaction route. The cascaded effect of the round-robin arbitration is that the farther a source is from the destination, the less service it gets from the destination. The first part of this thesis addresses this issue, and proposes thus far the simplest yet the most effective way of providing the fairness to all the nodes on NoC. It applies weighted round-robin scheme where the weights are determined at run-time depending on which cores are allocated to applications or threads running on the CMP. RTL implementation and synthesis are done to show the simplicity of the proposed scheme. Simulation with synthetic traffic patterns and SPEC CPU2006 benchmark applications show that the proposed approach results in outstanding equality-of-service characteristics.
The second part of this thesis deals with the impact of the reconfigurable communication architecture on the performance of a CMP system. One of the pitfalls of NoC is long access latency due to increased hop count between a source and its destination. For example, NoC with mesh topology has its hop count proportional to its size. Because of this, while being a common choice for CMP, mesh topology is said to be inscalable in terms of the number of cores. Some alternatives to mesh topology exist, one of them being high radix NoCs. They replace short and wide channels of mesh with long and narrow ones achieving fewer hop counts. Another option is to cluster cores so that the dimension of mesh network reduces. The clusters are formed by grouping cores via local communication fabric. The clusters are interconnected by a global communication fabric, often in the shape of mesh topology. Many types of local communication fabric are explored in previous researches, including another NoC with topologies of mesh, ring, etc. However, bus has become one of the most favorable choices for the local connection because of its simplicity. The simplicity leads local communications to be performed with high performance, low chip area, low power consumption, etc. One of the issues in forming core clusters in CMP is their grain size. Tying too many cores into a cluster results in the congestion on the bus, reducing the performance of the local communications. On the other hand, too few cores in a cluster misses the chances of improving system performance by efficient local communications through the bus. It is obvious that the optimal number of cores in a cluster depends on the applications that run on the CMP. Bus reconfiguration with bus segments and switches can be a solution for varying cluster size on a CMP. In addition to the variable cluster sizes, bus reconfiguration has another advantage of processor (not process) migration. Bus reconfiguration can reconnect cores and caches so that the distance between cores and data are reduced dynamically. In this way, data copies and network transactions can be dramatically reduced to improve the system performance. The second part of this thesis addresses this issue and proposes a reconfigurable bus-mesh architecture to accelerate pipelined applications. With the proposed architecture, the data transfer between the successive pipeline stages are done not by data copies but by processor migrations. Systematic management of bus segments and L1 data caches are required to achieve efficient use of the reconfigurability. The proposed architecture is compared with the baseline architecture, which maintains cache coherence with hardware. Multilayer perceptron (MLP), convolutional neural network (CNN), and JPEG decoder are implemented as example pipelined applications using multi-threaded programming model. The in-house full system simulator is implemented and used to measure the performance improvement of the proposed architecture. The experimental results show that 21.75 %, 14.40 %, and 12.74 % execution cycle reductions are achieved for MLP, CNN, and JPEG decoder, respectively.Part I Adaptively Weighted Round-Robin Arbitration for Equality of Service in a Many-Core Network-on-Chip [1] 1
Chapter 1 Introduction 3
Chapter 2 Previous Work 7
Chapter 3 Position-Based Weighted Round-Robin Arbitration 11
Chapter 4 Adaptively Weighted Round-Robin Arbitration 17
4.1 Hardware Implementation for weight update 18
4.2 Arbitration Weight Determination 22
Chapter 5 Experimental Results 25
5.1 Open-Loop Measurements 25
5.2 Closed-Loop Measurements 29
5.3 Hardware Implementation 33
Chapter 6 Conclusion 35
Part II Accelerating Pipelined Applications with Reconfigurable Bus-Mesh Communication Architecture in Chip Multiprocessors 37
Chapter 7 Introduction 39
Chapter 8 Backgrounds and Previous Work 43
8.1 Segmented Bus 43
8.2 CMPs with Reconfigurable Bus-Mesh Communication Architecture 44
8.3 Near-Threshold Computing 48
Chapter 9 Baseline Architecture 51
Chapter 10 Motivation 55
Chapter 11 Reconfigurable Bus-Mesh Architecture 61
11.1 Thread Programming Model 61
11.2 Cluster Size 64
11.3 Organizing Multiple L1Ds and SPM Banks in a Cluster 66
11.4 L1 Data Cache / SPM Partitioning 70
11.5 Reconfiguration Overheads 71
Chapter 12 Experimental Results 75
12.1 Pipelined Applications 75
12.2 Simulation Environment 78
12.3 Memory Operations Latency Breakdown 79
Chapter 13 Conclusion 85
Bibliography 87
๊ตญ๋ฌธ์ด๋ก 95Docto
Aspects of practical implementations of PRAM algorithms
The PRAM is a shared memory model of parallel computation which abstracts away from inessential engineering details. It provides a very simple architecture independent model and provides a good programming environment. Theoreticians of the computer science community have proved that it is possible to emulate the theoretical PRAM model using current technology. Solutions have been found for effectively interconnecting processing elements, for routing data on these networks and for distributing the data among memory modules without hotspots. This thesis reviews this emulation and the possibilities it provides for large scale general purpose parallel computation. The emulation employs a bridging model which acts as an interface between the actual hardware and the PRAM model. We review the evidence that such a scheme crn achieve scalable parallel performance and portable parallel software and that PRAM algorithms can be optimally implemented on such practical models. In the course of this review we presented the following new results:
1. Concerning parallel approximation algorithms, we describe an NC algorithm for finding an approximation to a minimum weight perfect matching in a complete weighted graph. The algorithm is conceptually very simple and it is also the first NC-approximation algorithm for the task with a sub-linear performance ratio.
2. Concerning graph embedding, we describe dense edge-disjoint embeddings of the complete binary tree with n leaves in the following n-node communication networks: the hypercube, the de Bruijn and shuffle-exchange networks and the 2-dimcnsional mesh. In the embeddings the maximum distance from a leaf to the root of the tree is asymptotically optimally short. The embeddings facilitate efficient implementation of many PRAM algorithms on networks employing these graphs as interconnection networks.
3. Concerning bulk synchronous algorithmics, we describe scalable transportable algorithms for the following three commonly required types of computation; balanced tree computations. Fast Fourier Transforms and matrix multiplications
- โฆ