97 research outputs found
Application-Aware Deadlock-Free Oblivious Routing
Conventional oblivious routing algorithms are either not application-aware or assume that each flow has its own private channel to ensure deadlock avoidance. We present a framework for application-aware routing that assures deadlock-freedom under one or more channels by forcing routes to conform to an acyclic channel dependence graph. Arbitrary minimal routes can be made deadlock-free through appropriate static channel allocation when two or more channels are available. Given bandwidth estimates for flows, we present a mixed integer-linear programming (MILP) approach and a heuristic approach for producing deadlock-free routes that minimize maximum channel load. The heuristic algorithm is calibrated using the MILP algorithm and evaluated on a number of benchmarks through detailed network simulation. Our framework can be used to produce application-aware routes that target the minimization of latency, number of flows through a link, bandwidth, or any combination thereof
Application-Aware Deadlock-Free Oblivious Routing
Conventional oblivious routing algorithms are either not application-aware or assume that each flow has its own private channel to ensure deadlock avoidance. We present a framework for application-aware routing that assures deadlock-freedom under one or more channels by forcing routes to conform to an acyclic channel dependence graph. Arbitrary minimal routes can be made deadlock-free through appropriate static channel allocation when two or more channels are available. Given bandwidth estimates for flows, we present a mixed integer-linear programming (MILP) approach and a heuristic approach for producing deadlock-free routes that minimize maximum channel load. The heuristic algorithm is calibrated using the MILP algorithm and evaluated on a number of benchmarks through detailed network simulation. Our framework can be used to produce application-aware routes that target the minimization of latency, number of flows through a link, bandwidth, or any combination thereof
Static virtual channel allocation in oblivious routing
Most virtual channel routers have multiple virtual channels to mitigate the effects of head-of-line blocking. When there are more flows than virtual channels at a link, packets or flows must compete for channels, either in a dynamic way at each link or by static assignment computed before transmission starts. In this paper, we present methods that statically allocate channels to flows at each link when oblivious routing is used, and ensure deadlock freedom for arbitrary minimal routes when two or more virtual channels are available. We then experimentally explore the performance trade-offs of static and dynamic virtual channel allocation for various oblivious routing methods, including DOR, ROMM, Valiant and a novel bandwidth-sensitive oblivious routing scheme (BSORM). Through judicious separation of flows, static allocation schemes often exceed the performance of dynamic allocation schemes
Accelerating Communication in On-Chip Interconnection Networks
Due to the ever-shrinking feature size in CMOS process technology, it is expected that future chip multiprocessors (CMPs) will have hundreds or thousands of processing cores. To support a massively large number of cores, packet-switched on-chip interconnection networks have become a de facto communication paradigm in CMPs. However, the on-chip networks have several drawbacks, such as limited on-chip resources, increasing communication latency, and insufficient communication bandwidth.
In this dissertation, several schemes are proposed to accelerate communication in on-chip interconnection networks within area and cost budgets to overcome the problems. First, an early transition scheme for fully adaptive routing algorithms is proposed to improve network throughput. Within a limited number of resources, previously proposed fully adaptive routing algorithms have low utilization in escape channels. To increase utilization of escape channels, it transfers packets earlier before the normal channels are full. Second, a pseudo-circuit scheme is proposed to reduce network latency using communication temporal locality. Reducing per-hop router delay becomes more important for communication latency reduction in larger on-chip interconnection networks. To improve communication latency, the previous arbitration information is reused to bypass switch arbitration. For further acceleration, we also propose two aggressive schemes, pseudo-circuit speculation and buffer bypassing. Third, two handshake schemes are proposed to improve network throughput for nanophotonic interconnects. Nanophotonic interconnects have been proposed to replace metal wires with optical links in on-chip interconnection networks for low latency and power consumptions as well as high bandwidth. To minimize the average token waiting time of the nanophotonic interconnects, the traditional credit-based flow control is removed. Thus, the handshake schemes increase link utilization and enhance network throughput
NoCo: ILP-based worst-case contention estimation for mesh real-time manycores
Manycores are capable of providing the computational demands required by functionally-advanced critical applications in domains such as automotive and avionics. In manycores a network-on-chip (NoC) provides access to shared caches and memories and hence concentrates most of the contention that tasks suffer, with effects on the worst-case contention delay (WCD) of packets and tasks' WCET. While several proposals minimize the impact of individual NoC parameters on WCD, e.g. mapping and routing, there are strong dependences among these NoC parameters. Hence, finding the optimal NoC configurations requires optimizing all parameters simultaneously, which represents a multidimensional optimization problem. In this paper we propose NoCo, a novel approach that combines ILP and stochastic optimization to find NoC configurations in terms of packet routing, application mapping, and arbitration weight allocation. Our results show that NoCo improves other techniques that optimize a subset of NoC parameters.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness under grant TIN2015-
65316-P and the HiPEAC Network of Excellence. It also received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (agreement No. 772773). Carles Hernández
is jointly supported by the MINECO and FEDER funds
through grant TIN2014-60404-JIN. Jaume Abella has been
partially supported by the Spanish Ministry of Economy and
Competitiveness under Ramon y Cajal postdoctoral fellowship
number RYC-2013-14717. Enrico Mezzetti has been partially
supported by the Spanish Ministry of Economy and Competitiveness
under Juan de la Cierva-Incorporaci´on postdoctoral
fellowship number IJCI-2016-27396.Peer ReviewedPostprint (author's final draft
Simulation Of Multi-core Systems And Interconnections And Evaluation Of Fat-Mesh Networks
Simulators are very important in computer architecture research as they enable the exploration of new architectures to obtain detailed performance evaluation without building costly physical hardware. Simulation is even more critical to study future many-core architectures as it provides the opportunity to assess currently non-existing computer systems. In this thesis, a multiprocessor simulator is presented based on a cycle accurate architecture simulator called SESC. The shared L2 cache system is extended into a distributed shared cache (DSC) with a directory-based cache coherency protocol. A mesh network module is extended and integrated into SESC to replace the bus for scalable inter-processor communication. While these efforts complete an extended multiprocessor simulation infrastructure, two interconnection enhancements are proposed and evaluated. A novel non-uniform fat-mesh network structure similar to the idea of fat-tree is proposed. This non-uniform mesh network takes advantage of the average traffic pattern, typically all-to-all in DSC, to dedicate additional links for connections with heavy traffic (e.g., near the center) and fewer links for lighter traffic (e.g., near the periphery). Two fat-mesh schemes are implemented based on different routing algorithms. Analytical fat-mesh models are constructed by presenting the expressions for the traffic requirements of personalized all-to-all traffic. Performance improvements over the uniform mesh are demonstrated in the results from the simulator. A hybrid network consisting of one packet switching plane and multiple circuit switching planes is constructed as the second enhancement. The circuit switching planes provide fast paths between neighbors with heavy communication traffic. A compiler technique that abstracts the symbolic expressions of benchmarks' communication patterns can be used to help facilitate the circuit establishment
Application-aware deadlock-free oblivious routing
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 67-71).Systems that can be integrated on a single silicon die have become larger and increasingly complex, and wire designs as communication mechanisms for these systems on chip (SoC) have shown to be a limiting factor in their performance. As an approach to remove the limitation of communication and to overcome wire delays, interconnection networks or Network-on-Chip (NoC) architectures have emerged. NoC architectures enable faster data communication between components and are more scalable. In designing NoC systems, there are three key issues; the topology, which directly depends on packaging technology and manufacturing costs, dictates the throughput and latency bounds of the network; the flit control protocol, which establishes how the network resources are allocated to packets exchanged between components; and finally, the routing algorithm, which aims at optimizing network performance for some topology and flow control protocol by selecting appropriate paths for those packets. Since the routing algorithm sits on top of the other layers of design, it is critical that routing is done in a matter that makes good usage of the resources of the network. Two main approaches to routing, oblivious and adaptive, have been followed in creating routing algorithms for these systems. Each approach has its pros and cons; oblivious routing, as opposite to adaptive routing, uses no network state information in determining routes at the cost of lower performance on certain applications, but has been widely used because of its simpler hardware requirements.(cont.) This thesis examines oblivious routing schemes for NoC architectures. It introduces various non-minimal, oblivious routing algorithms that globally allocate network bandwidth for a given application when estimated bandwidths for data transfers are provided, while ensuring deadlock freedom with no significant additional hardware. The work presents and evaluates these oblivious routing algorithms which attempt to minimize the maximum channel load (MCL) across all network links in an effort to maximize application throughput. Simulation results from popular synthetic benchmarks and concrete applications, such as an H.264 decoder, show that it is possible to achieve better performance than traditional deterministic and oblivious routing schemes.by Michel A. Kinsy.S.M
Recommended from our members
Improving GPGPU NoC Performance with Memory Controller Placement Awareness
General-purpose Graphics Processing Units (GPGPUs) have become a critical component in high-performance computing (HPC) systems in executing modern computational workloads. The high thread level parallelism (TLP) and programmable shader cores allow thousands of threads to execute in Parallel. The fast-scaling of GPGPUs have increased the demand for performance optimizations on Network-On-Chip (NoC) designs. Previous works have exploited NoC designs in the Chip Multiprocessor (CMP) environments but not much in GPGPU systems. Unlike CMPs, traffic is highly asymmetric in GPGPUs because of the many cores but very few memory controllers (MCs). The asymmetric traffic impacts the resource utilization and performance of NoCs. This work aims at analyzing the maximum channel load associated with different MC placements and VC partitioning in interconnection networks of GPGPUs. We also utilize the recently introduced concept of VC monopolization and propose new ways to reduce link contention and improve performance. Evaluation results from cycle-accurate simulation and representative workloads show that the proposed scheme increases application performance (IPC) by 14%, on average, and reduces interconnection network packet latency by 21%
Reliability-aware and energy-efficient system level design for networks-on-chip
2015 Spring.Includes bibliographical references.With CMOS technology aggressively scaling into the ultra-deep sub-micron (UDSM) regime and application complexity growing rapidly in recent years, processors today are being driven to integrate multiple cores on a chip. Such chip multiprocessor (CMP) architectures offer unprecedented levels of computing performance for highly parallel emerging applications in the era of digital convergence. However, a major challenge facing the designers of these emerging multicore architectures is the increased likelihood of failure due to the rise in transient, permanent, and intermittent faults caused by a variety of factors that are becoming more and more prevalent with technology scaling. On-chip interconnect architectures are particularly susceptible to faults that can corrupt transmitted data or prevent it from reaching its destination. Reliability concerns in UDSM nodes have in part contributed to the shift from traditional bus-based communication fabrics to network-on-chip (NoC) architectures that provide better scalability, performance, and utilization than buses. In this thesis, to overcome potential faults in NoCs, my research began by exploring fault-tolerant routing algorithms. Under the constraint of deadlock freedom, we make use of the inherent redundancy in NoCs due to multiple paths between packet sources and sinks and propose different fault-tolerant routing schemes to achieve much better fault tolerance capabilities than possible with traditional routing schemes. The proposed schemes also use replication opportunistically to optimize the balance between energy overhead and arrival rate. As 3D integrated circuit (3D-IC) technology with wafer-to-wafer bonding has been recently proposed as a promising candidate for future CMPs, we also propose a fault-tolerant routing scheme for 3D NoCs which outperforms the existing popular routing schemes in terms of energy consumption, performance and reliability. To quantify reliability and provide different levels of intelligent protection, for the first time, we propose the network vulnerability factor (NVF) metric to characterize the vulnerability of NoC components to faults. NVF determines the probabilities that faults in NoC components manifest as errors in the final program output of the CMP system. With NVF aware partial protection for NoC components, almost 50% energy cost can be saved compared to the traditional approach of comprehensively protecting all NoC components. Lastly, we focus on the problem of fault-tolerant NoC design, that involves many NP-hard sub-problems such as core mapping, fault-tolerant routing, and fault-tolerant router configuration. We propose a novel design-time (RESYN) and a hybrid design and runtime (HEFT) synthesis framework to trade-off energy consumption and reliability in the NoC fabric at the system level for CMPs. Together, our research in fault-tolerant NoC routing, reliability modeling, and reliability aware NoC synthesis substantially enhances NoC reliability and energy-efficiency beyond what is possible with traditional approaches and state-of-the-art strategies from prior work
- …