28 research outputs found

    Mapping applications with collectives over sub-communicators on torus networks

    Get PDF
    pre-printThe placement of tasks in a parallel application on specific nodes of a supercomputer can significantly impact performance. Traditionally, this task mapping has focused on reducing the distance between communicating tasks on the physical network. This minimizes the number of hops that point-to-point messages travel and thus reduces link sharing between messages and contention. However, for applications that use collectives over sub-communicators, this heuristic may not be optimal. Many collectives can benefit from an increase in bandwidth even at the cost of an increase in hop count, especially when sending large messages. For example, placing communicating tasks in a cube configuration rather than a plane or a line on a torus network increases the number of possible paths messages might take. This increases the available bandwidth which can lead to significant performance gains. We have developed Rubik, a tool that provides a simple and intuitive interface to create a wide variety of mappings for structured communication patterns. Rubik supports a number of elementary operations such as splits, tilts, or shifts, that can be combined into a large number of unique patterns. Each operation can be applied to disjoint groups of processes involved in collectives to increase the effective bandwidth. We demonstrate the use of Rubik for improving performance of two parallel codes, pF3D and Qbox, which use collectives over sub-communicators

    Performance Analysis of Different Interconnect Networks for Network on Chip

    Get PDF
    Nowadays, every electronic system, ranging from a small mobile phone to a satellite sent into space, has a System-on-Chip (SoC). SoCs have undergone rapid evolution and are still progressing at a swift pace. Due to explosive evolution of semiconductor industry, the devices are scaling down at a rapid rate and hence, the SoCs today have become communication-centric and shared bus system and crossbar system were fail to performed communication in side SoC. Interconnection networks offer an alternate solution to this communication paradigm and are becoming persistent in SoC. A NoC based interconnect network is a well-organized and efficiently use of limited communication channel while maintaining low packet latency, high saturation throughput, high communication bandwidth amongst different IPs core with a minimum area and low power-dissipation. In this thesis we present details performance analysis of four interconnect network mesh, torus, fat tree and butterfly in term of latency and throughput under uniform, tornado, neighbour, bit reversal and bit complement traffic using cycle accurate simulator. We also implement NoC interconnect networks on FPGA and see the effect of NoC parameters(FDW,FBD,VC) on FPGA, and validate their performance through FPGA synthesis . We found that the FDW and buffer depth have the great effect on FPGA resources, Virtual Channels (VCs) with all NoC parameter have considerably effect on buffer size and routing and logic requirements at NoC. We also analysis all interconnect networks in term of power and area at 65 nm technology by using synopsis tool. We found that butterfly interconnect network has highest power and Area efficient interconnect network but it will suffer heavily degradation on performance at high load so fat tree network is efficient network among all interconnect network

    Energy consumption in networks on chip : efficiency and scaling

    Get PDF
    Computer architecture design is in a new era where performance is increased by replicating processing cores on a chip rather than making CPUs larger and faster. This design strategy is motivated by the superior energy efficiency of the multi-core architecture compared to the traditional monolithic CPU. If the trend continues as expected, the number of cores on a chip is predicted to grow exponentially over time as the density of transistors on a die increases. A major challenge to the efficiency of multi-core chips is the energy used for communication among cores over a Network on Chip (NoC). As the number of cores increases, this energy also increases, imposing serious constraints on design and performance of both applications and architectures. Therefore, understanding the impact of different design choices on NoC power and energy consumption is crucial to the success of the multi- and many-core designs. This dissertation proposes methods for modeling and optimizing energy consumption in multi- and many-core chips, with special focus on the energy used for communication on the NoC. We present a number of tools and models to optimize energy consumption and model its scaling behavior as the number of cores increases. We use synthetic traffic patterns and full system simulations to test and validate our methods. Finally, we take a step back and look at the evolution of computer hardware in the last 40 years and, using a scaling theory from biology, present a predictive theory for power-performance scaling in microprocessor systems

    Studies on Core-Based Testing of System-on-Chips Using Functional Bus and Network-on-Chip Interconnects

    Get PDF
    The tests of a complex system such as a microprocessor-based system-onchip (SoC) or a network-on-chip (NoC) are difficult and expensive. In this thesis, we propose three core-based test methods that reuse the existing functional interconnects-a flat bus, hierarchical buses of multiprocessor SoC's (MPSoC), and a N oC-in order to avoid the silicon area cost of a dedicated test access mechanism (TAM). However, the use of functional interconnects as functional TAM's introduces several new problems. During tests, the interconnects-including the bus arbitrator, the bus bridges, and the NoC routers-operate in the functional mode to transport the test stimuli and responses, while the core under tests (CUT) operate in the test mode. Second, the test data is transported to the CUT through the functional bus, and not directly to the test port. Therefore, special core test wrappers that can provide the necessary control signals required by the different functional interconnect are proposed. We developed two types of wrappers, one buffer-based wrapper for the bus-based systems and another pair of complementary wrappers for the NoCbased systems. Using the core test wrappers, we propose test scheduling schemes for the three functionally different types of interconnects. The test scheduling scheme for a flat bus is developed based on an efficient packet scheduling scheme that minimizes both the buffer sizes and the test time under a power constraint. The schedulingscheme is then extended to take advantage of the hierarchical bus architecture of the MPSoC systems. The third test scheduling scheme based on the bandwidth sharing is developed specifically for the NoC-based systems. The test scheduling is performed under the objective of co-optimizing the wrapper area cost and the resulting test application time using the two complementary NoC wrappers. For each of the proposed methodology for the three types of SoC architec .. ture, we conducted a thorough experimental evaluation in order to verify their effectiveness compared to other methods

    Parallel algorithms for the solution of large sparse inequality systems on distributed memory architectures

    Get PDF
    Ankara : Department of Computer Engineering and Information Science and the Institute of Engineering and Science of Bilkent University, 1998.Thesis (Master's) -- Bilkent University, 1998.Includes bibliographical references leaves 101-104.Turna, EsmaM.S

    Advances in Time-Domain Electromagnetic Simulation Capabilities Through the Use of Overset Grids and Massively Parallel Computing

    Get PDF
    A new methodology is presented for conducting numerical simulations of electromagnetic scattering and wave propagation phenomena. Technologies from several scientific disciplines, including computational fluid dynamics, computational electromagnetics, and parallel computing, are uniquely combined to form a simulation capability that is both versatile and practical. In the process of creating this capability, work is accomplished to conduct the first study designed to quantify the effects of domain decomposition on the performance of a class of explicit hyperbolic partial differential equations solvers; to develop a new method of partitioning computational domains comprised of overset grids; and to provide the first detailed assessment of the applicability of overset grids to the field of computational electromagnetics. Furthermore, the first Finite Volume Time Domain (FVTD) algorithm capable of utilizing overset grids on massively parallel computing platforms is developed and implemented. Results are presented for a number of scattering and wave propagation simulations conducted using this algorithm, including two spheres in close proximity and a finned missile

    Design and Optimization in Near-term Quantum Computation

    Get PDF
    Quantum computers have come a long way since conception, and there is still a long way to go before the dream of universal, fault-tolerant computation is realized. In the near term, quantum computers will occupy a middle ground that is popularly known as the “Noisy, Intermediate-Scale Quantum” (or NISQ) regime. The NISQ era represents a transition in the nature of quantum devices from experimental to computational. There is significant interest in engineering NISQ devices and NISQ algorithms in a manner that will guide the development of quantum computation in this regime and into the era of fault-tolerant quantum computing. In this thesis, we study two aspects of near-term quantum computation. The first of these is the design of device architectures, covered in Chapters 2, 3, and 4. We examine different qubit connectivities on the basis of their graph properties, and present numerical and analytical results on the speed at which large entangled states can be created on nearest-neighbor grids and graphs with modular structure. Next, we discuss the problem of permuting qubits among the nodes of the connectivity graph using only local operations, also known as routing. Using a fast quantum primitive to reverse the qubits in a chain, we construct a hybrid, quantum/classical routing algorithm on the chain. We show via rigorous bounds that this approach is faster than any SWAP-based algorithm for the same problem. The second part, which spans the final three chapters, discusses variational algorithms, which are a class of algorithms particularly suited to near-term quantum computation. Two prototypical variational algorithms, quantum adiabatic optimization (QAO) and the quantum approximate optimization algorithm (QAOA), are studied for the difference in their control strategies. We show that on certain crafted problem instances, bang-bang control (QAOA) can be as much as exponentially faster than quasistatic control (QAO). Next, we demonstrate the performance of variational state preparation on an analog quantum simulator based on trapped ions. We show that using classical heuristics that exploit structure in the variational parameter landscape, one can find circuit parameters efficiently in system size as well as circuit depth. In the experiment, we approximate the ground state of a critical Ising model with long-ranged interactions on up to 40 spins. Finally, we study the performance of Local Tensor, a classical heuristic algorithm inspired by QAOA on benchmarking instances of the MaxCut problem, and suggest physically motivated choices for the algorithm hyperparameters that are found to perform well empirically. We also show that our implementation of Local Tensor mimics imaginary-time quantum evolution under the problem Hamiltonian

    On Permuting Ability of a 2D Torus under XY Routing

    No full text
    Abstract – This paper introduces an exploration to permuting ability of a 2D torus under deterministic XY routing. The research is carried out for a number of communication models, namely, unidirectional uniaxial, bidirectional uniaxial, unidirectional biaxial, and bidirectional biaxial. Necessary and sufficient conditions of blocking occurrence in a 2D torus for uniaxial models are expressed mathematically with the use of congruence notion from number theory. Some examples of applying the technique are given. Frequently used permutations admissibility to a 2D torus of varied size under XY routing is checked computationally for aforementioned communication models. A simple way for realizing perfect shuffle and bit reversal permutations on 2D torus under XY routing is found

    27th Annual European Symposium on Algorithms: ESA 2019, September 9-11, 2019, Munich/Garching, Germany

    Get PDF
    corecore