103 research outputs found

    Routing on the Channel Dependency Graph:: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks

    Get PDF
    In the pursuit for ever-increasing compute power, and with Moore's law slowly coming to an end, high-performance computing started to scale-out to larger systems. Alongside the increasing system size, the interconnection network is growing to accommodate and connect tens of thousands of compute nodes. These networks have a large influence on total cost, application performance, energy consumption, and overall system efficiency of the supercomputer. Unfortunately, state-of-the-art routing algorithms, which define the packet paths through the network, do not utilize this important resource efficiently. Topology-aware routing algorithms become increasingly inapplicable, due to irregular topologies, which either are irregular by design, or most often a result of hardware failures. Exchanging faulty network components potentially requires whole system downtime further increasing the cost of the failure. This management approach becomes more and more impractical due to the scale of today's networks and the accompanying steady decrease of the mean time between failures. Alternative methods of operating and maintaining these high-performance interconnects, both in terms of hardware- and software-management, are necessary to mitigate negative effects experienced by scientific applications executed on the supercomputer. However, existing topology-agnostic routing algorithms either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables. Using the fail-in-place strategy, a well-established method for storage systems to repair only critical component failures, is a feasible solution for current and future HPC interconnects as well as other large-scale installations such as data center networks. Although, an appropriate combination of topology and routing algorithm is required to minimize the throughput degradation for the entire system. This thesis contributes a network simulation toolchain to facilitate the process of finding a suitable combination, either during system design or while it is in operation. On top of this foundation, a key contribution is a novel scheduling-aware routing, which reduces fault-induced throughput degradation while improving overall network utilization. The scheduling-aware routing performs frequent property preserving routing updates to optimize the path balancing for simultaneously running batch jobs. The increased deployment of lossless interconnection networks, in conjunction with fail-in-place modes of operation and topology-agnostic, scheduling-aware routing algorithms, necessitates new solutions to solve the routing-deadlock problem. Therefore, this thesis further advances the state-of-the-art by introducing a novel concept of routing on the channel dependency graph, which allows the design of an universally applicable destination-based routing capable of optimizing the path balancing without exceeding a given number of virtual channels, which are a common hardware limitation. This disruptive innovation enables implicit deadlock-avoidance during path calculation, instead of solving both problems separately as all previous solutions

    An improved non-local awareness of congestion and load balanced algorithm for the communication of on chip 2D mesh-based network

    Get PDF
    Due to advancements in multi-core design technology, IC (Integrated Circuits) designers have expanded the single chip multi-core design. A privileged way of communication effectively between these multi-cores is a Network on-chip (NoC). Design of an effective routing algorithm capable of routing data to non-congested paths is the most notable research challenge in NoC, by retrieving congestion information of non-local nodes. This research proposed an improved congestion-aware load balancing routing algorithm. Non-local or distant links congestion awareness is done by propagating congestion information via data packets. By counting number of hops from the source node, in the quadrant of the destination node, an intermediate node has been defined, and after the calculation of the least congested route to the intermediate node, this route is also stored in the data packet for source routing. Furthermore, for load balancing network is partitioned into two areas called high congested area (HCA) and low congested area (LCA). For load balancing, from HCA a node in LCA is selected as output for data packets. Comparison of the proposed algorithm is done in the form of average latency, average throughput, power consumption, and scalability analysis under synthetic traffic patterns. Under simulation experiments, it is shown improvement in an average latency and throughput of the proposed algorithm is 31.28% and 5.28% respectively, than existing

    Graph-based Heuristic Solution for Placing Distributed Video Processing Applications on Moving Vehicle Clusters

    Get PDF
    Vehicular fog computing (VFC) is envisioned as an extension of cloud and mobile edge computing to utilize the rich sensing and processing resources available in vehicles. We focus on slow-moving cars that spend a significant time in urban traffic congestion as a potential pool of onboard sensors, video cameras, and processing capacity. For leveraging the dynamic network and processing resources, we utilize a stochastic mobility model to select nodes with similar mobility patterns. We then design two distributed applications that are scaled in real-time and placed as multiple instances on selected vehicular fog nodes. We handle the unstable vehicular environment by a), Using real vehicle density data to build a realistic mobility model that helps in selecting nodes for service deployment b), Using communitydetection algorithms for selecting a robust vehicular cluster using the predicted mobility behavior of vehicles. The stability of the chosen cluster is validated using a graph centrality measure, and c), Graph-based placement heuristics is developed to find the optimal placement of service graphs based on a multi-objective constrained optimization problem with the objective of efficient resource utilization. The heuristic solves an important problem of processing data generated from distributed devices by balancing the trade-off between increasing the number of service instances to have enough redundancy of processing instances to increase resilience in the service in case of node or link failure, versus reducing their number to minimize resource usage. We compare our heuristic to a mixed integer program (MIP) solution and a first-fit heuristic. Our approach performs better than these comparable schemes in terms of resource utilization and/or has a lesser service latency when compared to an edge computingbased service placement scheme

    DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

    Full text link
    Data movement between the CPU and main memory is a first-order obstacle against improving performance, scalability, and energy efficiency in modern systems. Computer systems employ a range of techniques to reduce overheads tied to data movement, spanning from traditional mechanisms (e.g., deep multi-level cache hierarchies, aggressive hardware prefetchers) to emerging techniques such as Near-Data Processing (NDP), where some computation is moved close to memory. Our goal is to methodically identify potential sources of data movement over a broad set of applications and to comprehensively compare traditional compute-centric data movement mitigation techniques to more memory-centric techniques, thereby developing a rigorous understanding of the best techniques to mitigate each source of data movement. With this goal in mind, we perform the first large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory. We develop the first systematic methodology to classify applications based on the sources contributing to data movement bottlenecks. From our large-scale characterization of 77K functions across 345 applications, we select 144 functions to form the first open-source benchmark suite (DAMOV) for main memory data movement studies. We select a diverse range of functions that (1) represent different types of data movement bottlenecks, and (2) come from a wide range of application domains. Using NDP as a case study, we identify new insights about the different data movement bottlenecks and use these insights to determine the most suitable data movement mitigation mechanism for a particular application. We open-source DAMOV and the complete source code for our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.Comment: Our open source software is available at https://github.com/CMU-SAFARI/DAMO

    Domain-specific Architectures for Data-intensive Applications

    Full text link
    Graphs' versatile ability to represent diverse relationships, make them effective for a wide range of applications. For instance, search engines use graph-based applications to provide high-quality search results. Medical centers use them to aid in patient diagnosis. Most recently, graphs are also being employed to support the management of viral pandemics. Looking forward, they are showing promise of being critical in unlocking several other opportunities, including combating the spread of fake content in social networks, detecting and preventing fraudulent online transactions in a timely fashion, and in ensuring collision avoidance in autonomous vehicle navigation, to name a few. Unfortunately, all these applications require more computational power than what can be provided by conventional computing systems. The key reason is that graph applications present large working sets that fail to fit in the small on-chip storage of existing computing systems, while at the same time they access data in seemingly unpredictable patterns, thus cannot draw benefit from traditional on-chip storage. In this dissertation, we set out to address the performance limitations of existing computing systems so to enable emerging graph applications like those described above. To achieve this, we identified three key strategies: 1) specializing memory architecture, 2) processing data near its storage, and 3) message coalescing in the network. Based on these strategies, this dissertation develops several solutions: OMEGA, which employs specialized on-chip storage units, with co-located specialized compute engines to accelerate the computation; MessageFusion, which coalesces messages in the interconnect; and Centaur, providing an architecture that optimizes the processing of infrequently-accessed data. Overall, these solutions provide 2x in performance improvements, with negligible hardware overheads, across a wide range of applications. Finally, we demonstrate the applicability of our strategies to other data-intensive domains, by exploring an acceleration solution for MapReduce applications, which achieves a 4x performance speedup, also with negligible area and power overheads.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163186/1/abrahad_1.pd

    Contention and Reliability-Aware Energy Efficiency Task Mapping on NoC-Based MPSoCs

    Get PDF
    Recently, Network-on-Chip (NoC)-based Multi- Processor System-on-Chips (MPSoCs) have become popular computing platforms for real-time applications due to high communication performance and energy efficiency over traditional bus-based MPSoCs. Due to the nature of network structures, network congestion along with transient faults, can significantly affect communication efficiency and system reliability. Most existing works have rarely focused on the concurrent optimization of network contention, reliability, and energy consumption. Here, we study the problem of contention and reliability-aware task mapping under real-time constraints for dynamic voltage and frequency scaling-enabled NoC. The problem entails optimizing voltage/frequency on cores and links to reduce energy consumption and ensure system reliability, while task mapping and slack time are adopted to alleviate network contention and reduce latency. We aim to minimize computation and communication energy and balance workload. This problem is formulated as a mixed-integer nonlinear programming, and we present an effective linearization scheme that equivalently transforms it into a mixed-integer linear programming to find the optimal solution. To reduce computation time, we propose a three-step heuristic, including task allocation, frequency scaling and edge scheduling, and communication contention management. Finally, we perform extensive simulations to evaluate the proposed method. The results show we can achieve 31.6% and 21.7% energy savings, with 95.5% and 98.6% less contention than the existing methods

    Efficient Interconnection Network Design for Heterogeneous Architectures

    Get PDF
    The onset of big data and deep learning applications, mixed with conventional general-purpose programs, have driven computer architecture to embrace heterogeneity with specialization. With the ever-increasing interconnected chip components, future architectures are required to operate under a stricter power budget and process emerging big data applications efficiently. Interconnection network as the communication backbone thus is facing the grand challenges of limited power envelope, data movement and performance scaling. This dissertation provides interconnect solutions that are specialized to application requirements towards power-/energy-efficient and high-performance computing for heterogeneous architectures. This dissertation examines the challenges of network-on-chip router power-gating techniques for general-purpose workloads to save static power. A voting approach is proposed as an adaptive power-gating policy that considers both local and global traffic status through router voting. In addition, low-latency routing algorithms are designed to guarantee performance in irregular power-gating networks. This holistic solution not only saves power but also avoids performance overhead. This research also introduces emerging computation paradigms to interconnects for big data applications to mitigate the pressure of data movement. Approximate network-on-chip is proposed to achieve high-throughput communication by means of lossy compression. Then, near-data processing is combined with in-network computing to further improve performance while reducing data movement. The two schemes are general to play as plug-ins for different network topologies and routing algorithms. To tackle the challenging computational requirements of deep learning workloads, this dissertation investigates the compelling opportunities of communication algorithm-architecture co-design to accelerate distributed deep learning. MultiTree allreduce algorithm is proposed to bond with message scheduling with network topology to achieve faster and contention-free communication. In addition, the interconnect hardware and flow control are also specialized to exploit deep learning communication characteristics and fulfill the algorithm needs, thereby effectively improving the performance and scalability. By considering application and algorithm characteristics, this research shows that interconnection network can be tailored accordingly to improve the power-/energy-efficiency and performance to satisfy heterogeneous computation and communication requirements
    • …
    corecore