560 research outputs found

    Virtual lines, a deadlock-free and real-time routing mechanism for ATM networks

    Get PDF
    In this paper, we present a routing mechanism and buffer allocation mechanism for an ATM switching fabric. Since the fabric will be used to transfer multimedia traffic, it should provide a guaranteed throughput and a bounded latency. We focus on the design of a suitable routing mechanism that is capable of fulfilling these requirements and is free of deadlocks. We will describe two basic concepts that can be used to implement deadlock-free routing. Routing of messages is closely related to buffering. We have organized the buffers into parallel FIFO's, each representing a virtual line. In this way, we not only have solved the problem of head of line blocking, but we can also give real-time guarantees. We will show that for local high-speed networks, it is more advantageous to have a proper flow control than to have large buffers. Although the virtual line concept can have a low buffer utilization, the transfer efficiency can be higher. The virtual line concept allows adaptive routing. The total throughput of the network can be improved by using alternative routes. Adaptive routing is attractive in networks where alternative routes are not much longer than the initial route(s). The network of the switching fabric is built up from switching elements interconnected in a Kautz topology

    Virtual lines, a deadlock free and real-time routing mechanism for ATM networks

    Get PDF
    In this paper we present a routing mechanism and buffer allocation mechanism for an ATM switching fabric. Since the fabric will be used to transfer multimedia traffic it should provide a guaranteed throughput and a bounded latency. We focus on the design of a suitable routing mechanism that is capable to fulfil these requirements and is free of deadlocks. We will describe two basic concepts that can be used to implement deadlock free routing. Routing of messages is closely related to buffering. We have organized the buffers into parallel fifos, each representing a virtual line. In this way we not only have solved the problem of Head Of Line blocking, but we can also give real-time guarantees. We will show that for local high-speed networks it is more advantageous to have a proper flow control than to have large buffers. Although the virtual line concept can have a low buffer utilization, the transfer efficiency can be higher. The virtual lines concept allows adaptive routing. The total throughput of the network can be improved by using alternative routes. Adaptive routing is attractive in networks where alternative routes are not much longer than the initial route(s). The network of the switching fabric is built up from switching elements interconnected in a Kautz topology

    Energy reduction in 3D NoCs through communication optimization

    Get PDF
    Cataloged from PDF version of article.Network-on-Chip (NoC) architectures and three-dimensional (3D) integrated circuits have been introduced as attractive options for overcoming the barriers in interconnect scaling while increasing the number of cores. Combining these two approaches is expected to yield better performance and higher scalability. This paper explores the possibility of combining these two techniques in a heterogeneity aware fashion. Specifically, on a heterogeneous 3D NoC architecture, we explore how different types of processors can be optimally placed to minimize data access costs. Moreover, we select the optimal set of links with optimal voltage levels. The experimental results indicate significant savings in energy consumption across a wide range of values of our major simulation parameters

    HARDWARE DESIGN OF MESSAGE PASSING ARCHITECTURE ON HETEROGENEOUS SYSTEM

    Get PDF
    Heterogeneous multi/many-core chips are commonly used in today’s top tier supercomputers. Similar heterogeneous processing elements — or, computation ac- celerators — are commonly found in FPGA systems. Within both multi/many-core chips and FPGA systems, the on-chip network plays a critical role by connecting these processing elements together. However, The common use of the on-chip network is for point-to-point communication between on-chip components and the memory in- terface. As the system scales up with more nodes, traditional programming methods, such as MPI, cannot effectively use the on-chip network and the off-chip network, therefore could make communication the performance bottleneck. This research proposes a MPI-like Message Passing Engine (MPE) as part of the on-chip network, providing point-to-point and collective communication primitives in hardware. On one hand, the MPE improves the communication performance by offloading the communication workload from the general processing elements. On the other hand, the MPE provides direct interface to the heterogeneous processing ele- ments which can eliminate the data path going around the OS and libraries. Detailed experimental results have shown that the MPE can significantly reduce the com- munication time and improve the overall performance, especially for heterogeneous computing systems because of the tight coupling with the network. Additionally, a hybrid “MPI+X” computing system is tested and it shows MPE can effectively of- fload the communications and let the processing elements play their strengths on the computation

    Design and resource management of reconfigurable multiprocessors for data-parallel applications

    Get PDF
    FPGA (Field-Programmable Gate Array)-based custom reconfigurable computing machines have established themselves as low-cost and low-risk alternatives to ASIC (Application-Specific Integrated Circuit) implementations and general-purpose microprocessors in accelerating a wide range of computation-intensive applications. Most often they are Application Specific Programmable Circuiits (ASPCs), which are developer programmable instead of user programmable. The major disadvantages of ASPCs are minimal programmability, and significant time and energy overheads caused by required hardware reconfiguration when the problem size outnumbers the available reconfigurable resources; these problems are expected to become more serious with increases in the FPGA chip size. On the other hand, dominant high-performance computing systems, such as PC clusters and SMPs (Symmetric Multiprocessors), suffer from high communication latencies and/or scalability problems. This research introduces low-cost, user-programmable and reconfigurable MultiProcessor-on-a-Programmable-Chip (MPoPC) systems for high-performance, low-cost computing. It also proposes a relevant resource management framework that deals with performance, power consumption and energy issues. These semi-customized systems reduce significantly runtime device reconfiguration by employing userprogrammable processing elements that are reusable for different tasks in large, complex applications. For the sake of illustration, two different types of MPoPCs with hardware FPUs (floating-point units) are designed and implemented for credible performance evaluation and modeling: the coarse-grain MIMD (Multiple-Instruction, Multiple-Data) CG-MPoPC machine based on a processor IP (Intellectual Property) core and the mixed-mode (MIMD, SIMD or M-SIMD) variant-grain HERA (HEterogeneous Reconfigurable Architecture) machine. In addition to alleviating the above difficulties, MPoPCs can offer several performance and energy advantages to our data-parallel applications when compared to ASPCs; they are simpler and more scalable, and have less verification time and cost. Various common computation-intensive benchmark algorithms, such as matrix-matrix multiplication (MMM) and LU factorization, are studied and their parallel solutions are shown for the two MPoPCs. The performance is evaluated with large sparse real-world matrices primarily from power engineering. We expect even further performance gains on MPoPCs in the near future by employing ever improving FPGAs. The innovative nature of this work has the potential to guide research in this arising field of high-performance, low-cost reconfigurable computing. The largest advantage of reconfigurable logic lies in its large degree of hardware customization and reconfiguration which allows reusing the resources to match the computation and communication needs of applications. Therefore, a major effort in the presented design methodology for mixed-mode MPoPCs, like HERA, is devoted to effective resource management. A two-phase approach is applied. A mixed-mode weighted Task Flow Graph (w-TFG) is first constructed for any given application, where tasks are classified according to their most appropriate computing mode (e.g., SIMD or MIMD). At compile time, an architecture is customized and synthesized for the TFG using an Integer Linear Programming (ILP) formulation and a parameterized hardware component library. Various run-time scheduling schemes with different performanceenergy objectives are proposed. A system-level energy model for HERA, which is based on low-level implementation data and run-time statistics, is proposed to guide performance-energy trade-off decisions. A parallel power flow analysis technique based on Newton\u27s method is proposed and employed to verify the methodology

    Parallel and Distributed Computing

    Get PDF
    The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

    Power Bounded Computing on Current & Emerging HPC Systems

    Get PDF
    Power has become a critical constraint for the evolution of large scale High Performance Computing (HPC) systems and commercial data centers. This constraint spans almost every level of computing technologies, from IC chips all the way up to data centers due to physical, technical, and economic reasons. To cope with this reality, it is necessary to understand how available or permissible power impacts the design and performance of emergent computer systems. For this reason, we propose power bounded computing and corresponding technologies to optimize performance on HPC systems with limited power budgets. We have multiple research objectives in this dissertation. They center on the understanding of the interaction between performance, power bounds, and a hierarchical power management strategy. First, we develop heuristics and application aware power allocation methods to improve application performance on a single node. Second, we develop algorithms to coordinate power across nodes and components based on application characteristic and power budget on a cluster. Third, we investigate performance interference induced by hardware and power contentions, and propose a contention aware job scheduling to maximize system throughput under given power budgets for node sharing system. Fourth, we extend to GPU-accelerated systems and workloads and develop an online dynamic performance & power approach to meet both performance requirement and power efficiency. Power bounded computing improves performance scalability and power efficiency and decreases operation costs of HPC systems and data centers. This dissertation opens up several new ways for research in power bounded computing to address the power challenges in HPC systems. The proposed power and resource management techniques provide new directions and guidelines to green exscale computing and other computing systems

    Automatic synthesis and optimization of chip multiprocessors

    Get PDF
    The microprocessor technology has experienced an enormous growth during the last decades. Rapid downscale of the CMOS technology has led to higher operating frequencies and performance densities, facing the fundamental issue of power dissipation. Chip Multiprocessors (CMPs) have become the latest paradigm to improve the power-performance efficiency of computing systems by exploiting the parallelism inherent in applications. Industrial and prototype implementations have already demonstrated the benefits achieved by CMPs with hundreds of cores.CMP architects are challenged to take many complex design decisions. Only a few of them are:- What should be the ratio between the core and cache areas on a chip?- Which core architectures to select?- How many cache levels should the memory subsystem have?- Which interconnect topologies provide efficient on-chip communication?These and many other aspects create a complex multidimensional space for architectural exploration. Design Automation tools become essential to make the architectural exploration feasible under the hard time-to-market constraints. The exploration methods have to be efficient and scalable to handle future generation on-chip architectures with hundreds or thousands of cores.Furthermore, once a CMP has been fabricated, the need for efficient deployment of the many-core processor arises. Intelligent techniques for task mapping and scheduling onto CMPs are necessary to guarantee the full usage of the benefits brought by the many-core technology. These techniques have to consider the peculiarities of the modern architectures, such as availability of enhanced power saving techniques and presence of complex memory hierarchies.This thesis has several objectives. The first objective is to elaborate the methods for efficient analytical modeling and architectural design space exploration of CMPs. The efficiency is achieved by using analytical models instead of simulation, and replacing the exhaustive exploration with an intelligent search strategy. Additionally, these methods incorporate high-level models for physical planning. The related contributions are described in Chapters 3, 4 and 5 of the document.The second objective of this work is to propose a scalable task mapping algorithm onto general-purpose CMPs with power management techniques, for efficient deployment of many-core systems. This contribution is explained in Chapter 6 of this document.Finally, the third objective of this thesis is to address the issues of the on-chip interconnect design and exploration, by developing a model for simultaneous topology customization and deadlock-free routing in Networks-on-Chip. The developed methodology can be applied to various classes of the on-chip systems, ranging from general-purpose chip multiprocessors to application-specific solutions. Chapter 7 describes the proposed model.The presented methods have been thoroughly tested experimentally and the results are described in this dissertation. At the end of the document several possible directions for the future research are proposed

    Next-generation acceleration and code optimization for light transport in turbid media using GPUs

    Get PDF
    A highly optimized Monte Carlo (MC) code package for simulating light transport is developed on the latest graphics processing unit (GPU) built for general-purpose computing from NVIDIA - the Fermi GPU. In biomedical optics, the MC method is the gold standard approach for simulating light transport in biological tissue, both due to its accuracy and its flexibility in modelling realistic, heterogeneous tissue geometry in 3-D. However, the widespread use of MC simulations in inverse problems, such as treatment planning for PDT, is limited by their long computation time. Despite its parallel nature, optimizing MC code on the GPU has been shown to be a challenge, particularly when the sharing of simulation result matrices among many parallel threads demands the frequent use of atomic instructions to access the slow GPU global memory. This paper proposes an optimization scheme that utilizes the fast shared memory to resolve the performance bottleneck caused by atomic access, and discusses numerous other optimization techniques needed to harness the full potential of the GPU. Using these techniques, a widely accepted MC code package in biophotonics, called MCML, was successfully accelerated on a Fermi GPU by approximately 600x compared to a state-of-the-art Intel Core i7 CPU. A skin model consisting of 7 layers was used as the standard simulation geometry. To demonstrate the possibility of GPU cluster computing, the same GPU code was executed on four GPUs, showing a linear improvement in performance with an increasing number of GPUs. The GPU-based MCML code package, named GPU-MCML, is compatible with a wide range of graphics cards and is released as an open-source software in two versions: an optimized version tuned for high performance and a simplified version for beginners (http://code.google.com/p/gpumcml)
    • 

    corecore